Hi

Apologies for the delay in replying.

The informational entropy algorithm is quite simple.

For the set of 64 possible triplets, within your sliding window of size n, get the number of each. The frequencies are therefore:

pXXX = fXXX/(n/3)

n/3 is of course the number of possible triplets in the window.

So now you have a table with 64 rows, something like:

AAA  0.03
AAC  0.01
AAG  0.02

and so on.  Each of these is pXXX.  Then calculate:

eXXX = -pXXX log(2) XXX

then sum to sigma(eXXX). This is the Shannon entropy of the sequence in that frame in that window. Now slide the window and plot how the value changes.

In the Java, it works as follows:

ent -= freq*Math.log(freq)/Math.log(2);  //  H = sigma(p* log(2) p)

as log in Java is base 10, so you get a log base 2 by dividing it by log 10 of 2.

Entropy is a measure of the disorder of the sequence. Coding sequences, and repeat sequences score lower than random sequence (which in this case will score 6. "Typical" scores for chromosomal sequences - I'm looking at the herpes simplex 1 genome - are in the region 4.5 to 5.8 depending on the window size). So it is a feature detector of sorts. It is partly a gene-detector, although you are in danger of confusing non-coding repeats with coding sequences, so if used as a gene-detector, always back it up with another gene-detecting algorithm.

cheers
Derek


_______________________________________________
Artemis-users mailing list
[email protected]
http://lists.sanger.ac.uk/mailman/listinfo/artemis-users

Reply via email to