Guy (>): > I may have asked them why they did not map (A,C,G,T) -> (0,1,2,3) but > since then, I've learned more about what GC-content implies in terms of > chemistry -- it also seems to have evolutionary implications, about > which I know nothing.
With this I can help at least, being schooled in molecular biology. People who don't care much about biochemistry, feel free to ignore this post, which is admittedly not about Perl 6. DNA is ultimately to be turned into proteins, which make up our bodies. The "genetic code" is a hash table from all possible triplets of (A, C, G, T) -- so, 64 possible triplets -- to 21 amino acids that chain up to make proteins. 21 is smaller than 64, so there's "redundancy" in the genetic code (mathematicians would call this a "non-injective mapping"). Since several triplets may map to the same amino acid, there is some "wiggle room" in the choice of bases. (Furthermore, some amino acids are chemically quite similar, giving even more potential wiggle room.) Now, let's say you're a thermophilic bacterium living around the hot springs of Iceland. Your DNA is under a lot of stress from the heat, and the bonds break up all the time. Something needs to be done, stable DNA is important. You decide -- well, natural selection pushes you as a group, really -- to favour GC bonds rather than AT bonds, because a GC bond has three hydrogen bonds whereas an AT bond has only two. You're constrained by the proteins you want to produce, but the wiggle room allows you to favour GC bonds. Higher GC content -> more hydrogen bonds -> sturdier DNA -> better survival. Hopefully that also explains why the mapping in the algorithm can be GC -> 1 and AT -> 0. From the viewpoint of hydrogen bonds, only this simpler mapping matters. <http://en.wikipedia.org/wiki/Genetic_code> <http://en.wikipedia.org/wiki/GC-content> <http://en.wikipedia.org/wiki/Injective_function> <http://en.wikipedia.org/wiki/Thermophile> Hope that helps, // Carl