Re: [BiO BB] time efficient global alignment algorithm

Ryan Golhar Tue, 04 Aug 2009 18:14:16 -0700

I'm trying to perform a large amount of sequence alignments of long DNA
sequences, some up to 163,000+ bp in length. I was trying to use the
standard Needleman-Wunsch algorithm, but the matrix used requires a
large amount of memory...about 100 GB of memory. This obviously won't work.
How many were you trying to align? You mean 163kb or 163Mb?
I was looking for test or comparisons for some alignment code Ihad which indexed the target sequences, don't recall the suggestionsfor that discussion but I was able to do simple genomes reasonably well( I think I used 2 strains of e coli or something about 5 megs long)on a desktop. If you can find responses to my request from a few yearsago that may ( or may not ) help. I'd offer my code, and indeed I think
I have it on a website, but I stopped development and not sure
it is nearly useful as-is unless you just want coarse alignment on
two similar sequences.

Hundreds of thousands. I'm trying to eliminate duplicates or nearduplicates (>90% similarity). I'm using the methodology fromcd-hit-est. However I'm not successful in getting that application torun on the number of sequences I have. Right now, I'm trying to clusterthe nt database, however later I would like to cluster other sequencesfrom other sources.

Many implementations of just about anything are bad with
memory management- sometimes just blocking or sorting or
compacting the internal representation can make a big improvement.
Not sure what exists along these lines but often some simplifcations

don't change results but decrease time/memory on futile possibilities.

Agreed. However in doing the dynamic programming matrix, you still needto allocate an m x n matrix of ints. With sequences of 163,000 bp inlength, you need about 100GB of RAM. Unless there is a way to using acompact representation of the DP matrix that I'm not aware of.

Are all of these nominally the same or are you trying to align

noise to noise?

Yes, they are nominally the same...they have at least 50% of thenon-overlapping words of the shorter of the two sequences.

Ideally, something with the speed similar to BLAST.
I guess in an odd way my approach could get there as it essentiallyqueries each string for "interesting" short sequences but I'd have to
check order ( howmany of these does it use etc). Last time I checked the
academic lit, IIRC this exact-string matching was an open research area maybe 
there have been advancements
in last few years that are trivial to code or exist in an academic's lab.

If there are, I haven't heard of any. My thought was to run a BLASTalignment on the two sequences using bl2seq. Then string together thenon-overlapping HSPs and perform a global alignment on the regions inbetween the HSPs. This is easy enough, but I want to see if there is asolution already out there first.


Ryan


_______________________________________________
BBB mailing list
[email protected]
http://www.bioinformatics.org/mailman/listinfo/bbb

Re: [BiO BB] time efficient global alignment algorithm

Reply via email to