Hi all, I believe I *may* have found a bug in compseq.
I have been using compseq to calculate the frequency of amino acids in translated DNA sequences. I find that frequently compseq takes the amino acid sequence to be DNA (they are sequences with an unusual composition, but then I am looking for odd proteins). So instead of the expected output for all amino acids with most being zero, I often get output for A,C,G,T and 'other'. I cannot see an obvious pattern that would explain this behaviour, but maybe you can help. Command line: compseq -seq compseq_bug.in -word 1 -frame 1 -out compseq_bug.out An example input and output file are pasted in below - I can provide many more. It might help if the user could specify whether the input sequence is DNA or protein, rather than the program working it out somehow? Best wishes Anette Here is an example of the problem: >Seq1 GSGGGGGSGGRGMGGWGGGRGSGVGGRGWGVG # # Output from 'compseq' # # Only words in frame 1 will be counted. # The Expected frequencies are calculated on the (false) assumption that every # word has equal frequency. # # The input sequences are: # Seq1 Word size 1 Total count 31 # # Word Obs Count Obs Frequency Exp Frequency Obs/Exp Frequency # A 0 0.0000000 0.2500000 0.0000000 C 0 0.0000000 0.2500000 0.0000000 G 20 0.6451613 0.2500000 2.5806452 T 0 0.0000000 0.2500000 0.0000000 Other 11 0.3548387 0.0000000 10000000000.0000000 Here is a similar sequence that works fine: >Seq2 VGSEGGGGGRRGEGGGGGGRGGGGGRWEEGAG # # Output from 'compseq' # # Only words in frame 1 will be counted. # The Expected frequencies are calculated on the (false) assumption that every # word has equal frequency. # # The input sequences are: # Seq2 Word size 1 Total count 31 # # Word Obs Count Obs Frequency Exp Frequency Obs/Exp Frequency # A 1 0.0322581 0.0476190 0.6774194 C 0 0.0000000 0.0476190 0.0000000 D 0 0.0000000 0.0476190 0.0000000 E 4 0.1290323 0.0476190 2.7096774 F 0 0.0000000 0.0476190 0.0000000 G 20 0.6451613 0.0476190 13.5483871 H 0 0.0000000 0.0476190 0.0000000 I 0 0.0000000 0.0476190 0.0000000 K 0 0.0000000 0.0476190 0.0000000 L 0 0.0000000 0.0476190 0.0000000 M 0 0.0000000 0.0476190 0.0000000 N 0 0.0000000 0.0476190 0.0000000 P 0 0.0000000 0.0476190 0.0000000 Q 0 0.0000000 0.0476190 0.0000000 R 4 0.1290323 0.0476190 2.7096774 S 1 0.0322581 0.0476190 0.6774194 T 0 0.0000000 0.0476190 0.0000000 U 0 0.0000000 0.0476190 0.0000000 V 0 0.0000000 0.0476190 0.0000000 W 1 0.0322581 0.0476190 0.6774194 Y 0 0.0000000 0.0476190 0.0000000 ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
