Although these are not the good ways to do, they are the workable solutions:
First, for each sequence in your database, make a long string of sequence. Then use a for loop scan over your long sequence string with the window size of your search sequence. You do all for each sequences in the database. It may take a few days if you need to scan big databases such as human genome. The other way is to elongate your short query to 17 or 21 nt (not sure which is the shortest one that blast works) long where blast can search. That means, if you have 15 nt oligo, you can creat four x four possible 17 nt sequences. Such as: AAACCCGGGC CCTTTAAaa AAACCCGGGC CCTTTAAag AAACCCGGGC CCTTTAAac AAACCCGGGC CCTTTAAat AAACCCGGGC CCTTTAAga AAACCCGGGC CCTTTAAgg AAACCCGGGC CCTTTAAgc AAACCCGGGC CCTTTAAgt AAACCCGGGC CCTTTAAca AAACCCGGGC CCTTTAAct AAACCCGGGC CCTTTAAcg AAACCCGGGC CCTTTAAcc ..... Then you run blast and combine all results from 16 17-nt sequences as the hits for your 15 nt query sequence. Hope this useful. Thanks, TU ================================== On Thu, 7 Dec 2006, Michael Thon wrote: > Hi Yun , you might try a clustering algorithm like blastclust (single > linkage clustering) or mcl (a.k.a tribe-mcl) or one of the others > that exist. I can't think of any EMBOSS apps that would solve this > problem, but maybe someone else has a better answer. > Mike > > > On Dec 7, 2006, at 2:36 PM, yun zheng wrote: > >> Hi, >> >> Are there any tools for find unique sequences from a large >> database? Many >> thanks. >> >> I need to find unique DNA sequences from a large database. A short >> piece is >> given as follows. >> >>> 001 >> aaaagttgtgtgtgtatgacaggtt >>> 013 >> aacctgtcatacacacacaactttt >>> 289 >> gttgtgtgtgtatgacaggtt >>> 375 >> tgtgtgtatgacaggttgat >>> 319 >> tcaacctgtcatacacaca >>> 177 >> cgcagtgtgtgtatgacagg >>> 271 >> gtcctacctgtcatacacac >>> 020 >> aagacataatgtgtgtatgacag >> >> All these seem to be the same sequence, since BLASTN gives very small >> e-values for their alignments. >> >> BLASTN 2.2.8 [Jan-05-2004] >> >> >> Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. >> Schaffer, >> Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), >> "Gapped BLAST and PSI-BLAST: a new generation of protein database >> search >> programs", Nucleic Acids Res. 25:3389-3402. >> >> Query= 001 >> (25 letters) >> >> Database: drought-clustered.fa >> 410 sequences; 8877 total letters >> >> Searching.done >> >> >> Score E >> Sequences producing significant alignments: >> (bits) >> Value >> >> 013 >> 50 >> 8e-11 >> 001 >> 50 >> 8e-11 >> 289 >> 42 >> 2e-08 >> 375 >> 34 >> 5e-06 >> 319 >> 34 >> 5e-06 >> 177 >> 32 >> 2e-05 >> 271 >> 30 >> 8e-05 >> 020 >> 28 >> 3e-04 >> >> Best regards. >> >> sincerely >> >> Zheng, Yun >> >> Department of Computer Science >> >> Washington University in St Louis >> >> Campus Box 1045 >> >> 1 Brookings Drive, St Louis, MO 63130 >> _______________________________________________ >> EMBOSS mailing list >> [email protected] >> http://lists.open-bio.org/mailman/listinfo/emboss > > _______________________________________________ > EMBOSS mailing list > [email protected] > http://lists.open-bio.org/mailman/listinfo/emboss > _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
