Hi
Apologies for the delay in replying.
The informational entropy algorithm is quite simple.
For the set of 64 possible triplets, within your sliding window of
size n, get the number of each. The frequencies are therefore:
pXXX = fXXX/(n/3)
n/3 is of course the number of possible triplets in the window.
So now you have a table with 64 rows, something like:
AAA 0.03
AAC 0.01
AAG 0.02
and so on. Each of these is pXXX. Then calculate:
eXXX = -pXXX log(2) XXX
then sum to sigma(eXXX). This is the Shannon entropy of the sequence
in that frame in that window. Now slide the window and plot how the
value changes.
In the Java, it works as follows:
ent -= freq*Math.log(freq)/Math.log(2); // H = sigma(p* log(2) p)
as log in Java is base 10, so you get a log base 2 by dividing it by
log 10 of 2.
Entropy is a measure of the disorder of the sequence. Coding
sequences, and repeat sequences score lower than random sequence
(which in this case will score 6. "Typical" scores for chromosomal
sequences - I'm looking at the herpes simplex 1 genome - are in the
region 4.5 to 5.8 depending on the window size). So it is a feature
detector of sorts. It is partly a gene-detector, although you are in
danger of confusing non-coding repeats with coding sequences, so if
used as a gene-detector, always back it up with another
gene-detecting algorithm.
cheers
Derek
_______________________________________________
Artemis-users mailing list
[email protected]
http://lists.sanger.ac.uk/mailman/listinfo/artemis-users