Mike Marchywka wrote: > Hi, > As I mentioned in previous posts, I'm using the drosophila DSCAM genes for > testing some tools. > I assembled a fasta file composed of 3 fly entries, > > $ cat all_fasta | grep ">" > >>AF260530 Drosophila melanogaster Dscam gene, complete cds. >>DQ317106 Drosophila yakuba Dscam gene, exons 3 through 24. >>DQ317109 Drosophila pseudoobscura Dscam gene, exons 3 through 24. > > > and tried aligning them with clustalw but minutes later still didn't have a > result. I was wondering if > someone could suggest a set of parameters or alternative alignment tool to do > a fast > alignment, even if a bit sloppy. I had always used to slow/accurate approach > and don't > know what options may be available for faster work- these sequences are each > about 50k long. >
We have been using MUMmer3 (http://mummer.sourceforge.net) for rapid alignments of whole genomes, genomes and contigs, and searching for repeats and inverted repeats in multiple sequences. MUMmer is very fast and has nucleotide and translated protein modes, as well as scatterplot graphical output, so is very good for finding regions of high identity in large sequences and graphically highlighting areas of interest. > > In the meantime, I was able to get a satisfactory result using exact string > matches using successively > shorter and shorter strings. This approach yields acceptable results in under > a minute and, if needed, you > could segment the questionable areas and feed them to clustal or other tool > for "better" alignment. > It seems to be fast due to only comparing sequences to a reference sequence ( > O(n*l^2) but "l" can be smaller > than sequence length as unique features can be found O(l*log(l)) ) . There > are, of course, likely to > be various pathological cases but for sequences known to be similar it seems > to work ok and the indexing > feature allows extraction of substrings with particular distributions ( > occuring only once in each sample for example). > I have aligned 2 ecoli strains in perhaps a few minutes and there weren't any > obvious pathological > results ( I obviously didn't check the whole thing either by eye or > programatically). > > Others have asked about testing method, I'd like to show how I'm going about > this with the DSCAM example. > The alignment is only one part of more general interest in finding > similar/different features between samples. > These sequences, it turns out, have exon locations in the ncbi entries. So, > it was pretty easy to check the alignments > by examining the locations of the exons in the aligned composite. In this > case, I aligned as follows, > ... > I'm aware of the following related alignment literature, open to ideas: > > $ string_test -about|unix2dos >/dev/clipboard > > Contact: [EMAIL PROTECTED] Nov 2007 > Comment: uses some indexing to get speed up, > Comment: motivation for RC rules from this etc , > Ref:http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1431710 > Commment: and should work well on text or (modified slightly ) binary code too > Note: More code in mm_align_tool > Note: Based loosely on references such as these but 'common sense' > Note: seemed to work well as these are after-the-fact lookups > Ref: > http://www.google.com/search?hl=en&safe=off&q=string+alignment+site%3Aciteseer.ist.psu.edu > > Ref: http://citeseer.ist.psu.edu/csuros05rapid.html > Comment: Csuros, M., Ma, B.: Rapid homology search with two-stage extension > and > Comment: daughter seeds. In: Proc. 11th Int. Computing and Combinatorics > Conf. (COCOON). > Comment: Volume 3595 of LNCS., Springer-Verlag (2005) 104-- 114 > Ref: http://citeseer.ist.psu.edu/468459.html > Ref: http://citeseer.ist.psu.edu/kahveci04speeding.html > Feb 2 2008 09:35:40 string_test.h182 > > > > > > Thanks. > > > > > Mike Marchywka > 586 Saint James Walk > Marietta GA 30067-7165 > 404-788-1216 (C)<- leave message > 989-348-4796 (P)<- emergency only > [EMAIL PROTECTED] > Note: Hotmail is blocking my mom's entire > ISP claiming it is to reduce spam but probably > to force users to use hotmail. Please DON'T > assume I am ignoring you and try > me on [EMAIL PROTECTED] if no reply > here. Thanks. > > > _________________________________________________________________ > Need to know the score, the latest news, or you need your HotmailĀ®-get your > "fix". > http://www.msnmobilefix.com/Default.aspx > _______________________________________________ > BBB mailing list > [email protected] > http://www.bioinformatics.org/mailman/listinfo/bbb > > -- -- Larye D. Parkins Information Engineering Services PMB 435, 610 N. 1st St., Ste 5 Hamilton, MT 59840 http://www.info-engineering-svc.com Making IT work since 1965. Member of: ACM, IEEE Computer Society, USENIX, SAGE, LOPSA _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb
