> I may have asked them why they did not map (A,C,G,T) -> (0,1,2,3) but
> since then, I've learned more about what GC-content implies in terms of
> chemistry -- it also seems to have evolutionary implications, about
> which I know nothing.
With this I can help at least, being schooled in molecular biology.
People who don't care much about biochemistry, feel free to ignore
this post, which is admittedly not about Perl 6.
DNA is ultimately to be turned into proteins, which make up our
bodies. The "genetic code" is a hash table from all possible triplets
of (A, C, G, T) -- so, 64 possible triplets -- to 21 amino acids that
chain up to make proteins. 21 is smaller than 64, so there's
"redundancy" in the genetic code (mathematicians would call this a
"non-injective mapping"). Since several triplets may map to the same
amino acid, there is some "wiggle room" in the choice of bases.
(Furthermore, some amino acids are chemically quite similar, giving
even more potential wiggle room.)
Now, let's say you're a thermophilic bacterium living around the hot
springs of Iceland. Your DNA is under a lot of stress from the heat,
and the bonds break up all the time. Something needs to be done,
stable DNA is important. You decide -- well, natural selection pushes
you as a group, really -- to favour GC bonds rather than AT bonds,
because a GC bond has three hydrogen bonds whereas an AT bond has only
two. You're constrained by the proteins you want to produce, but the
wiggle room allows you to favour GC bonds. Higher GC content -> more
hydrogen bonds -> sturdier DNA -> better survival.
Hopefully that also explains why the mapping in the algorithm can be
GC -> 1 and AT -> 0. From the viewpoint of hydrogen bonds, only this
simpler mapping matters.
Hope that helps,