replying to myself again ... (posts appear with a lag, when the opinion may have already changed) ok, that's basically what blast is doing. I didn't know about blast. _http://en.wikipedia.org/wiki/BLAST_ (http://en.wikipedia.org/wiki/BLAST) most cited paper in the 90s ! But I wanted a small program that I understand and can easily manipulate for different formats and additional features and without copyright,licence, so I wrote my own. For the original problem (as I understood it) - I would send the blast output to a file and filter the output for representatives. There should be a tool for that, (I don't know) Mine is again selfwritten. Hmm, sequences that are 99% identical to the original can't be only 90% identical to each other, so maybe I misunderstood.
In a message dated 18.05.2012 04:17:35 Westeuropäische Normalzeit, [email protected] writes: thinking more about it ... task: --------------------------------------- given two strings of amino-acids, S1 and S2. Typically the lengths are ~100-~million bytes for S1 and ~100M-~4B bytes for S2, where e.g. S2 is the list of all bacterial amino-acid sequences from genbank, joined together. Given threshold D , find all triples(s1,s2,L) of lengths L and subsequences s1 of length L from S1 and subsequences s2 of length L from S2 such that p-value(|s1-s2|) < D . Typically this is done for several Ds simultaneously, basically trying to find the n best subsequences for some given n. Solve that task as quickly as possible. ----------------------------------- algorithm suggestion: ---------------------------------------------------- for all length=(4?) subsequences s1 from S1 make a table T[s1] of addresses where that subsequence occurs in S1 walk through all length=4 subsequences s2 of S2 (load S2 into a cyclic 256 byte-buffer) whenever s2 is marked in T (T[s2]>0), check the extended subsequences x1 bytes forward, x2 bytes backward from s2 , enter it into the best(L) lists for the various lengths L=x1+x2+4. Print those that are <tolerance --------------------------------------------------- plot the best #matches(L) (=% similarity) in a chart this misses triples (s1,s2,L) that have no 4-length 100%-match but are good in total length nevertheless ---------------------------------------------------------------- does this program exist ? What is "blast" doing ? amino acid frequencies in genbank bacterial files: (I just calculated those - just in case someone is interested) L 76 219736963 A 65 212665642 G 71 164123234 V 86 155933244 E 69 132851172 I 73 130723255 S 83 128052651 R 82 123992485 D 68 118503645 T 84 117688310 K 75 105966220 P 80 97716246 F 70 84543510 N 78 81056701 Q 81 79888596 Y 89 64269341 M 77 51573988 H 72 45239258 W 87 26873940 C 67 20419909 X 88 175383 _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb
