I'm trying to perform a large amount of sequence alignments of long DNA
sequences, some up to 163,000+ bp in length. I was trying to use the
standard Needleman-Wunsch algorithm, but the matrix used requires a
large amount of memory...about 100 GB of memory. This obviously won't work.

How many were you trying to align? You mean 163kb or 163Mb?
I was looking for test or comparisons for some alignment code I had which indexed the target sequences, don't recall the suggestions for that discussion but I was able to do simple genomes reasonably well ( I think I used 2 strains of e coli or something about 5 megs long) on a desktop. If you can find responses to my request from a few years ago that may ( or may not ) help. I'd offer my code, and indeed I think
I have it on a website, but I stopped development and not sure
it is nearly useful as-is unless you just want coarse alignment on
two similar sequences.

Hundreds of thousands. I'm trying to eliminate duplicates or near duplicates (>90% similarity). I'm using the methodology from cd-hit-est. However I'm not successful in getting that application to run on the number of sequences I have. Right now, I'm trying to cluster the nt database, however later I would like to cluster other sequences from other sources.

Many implementations of just about anything are bad with
memory management- sometimes just blocking or sorting or
compacting the internal representation can make a big improvement.
Not sure what exists along these lines but often some simplifcations
don't change results but decrease time/memory on futile possibilities.

Agreed. However in doing the dynamic programming matrix, you still need to allocate an m x n matrix of ints. With sequences of 163,000 bp in length, you need about 100GB of RAM. Unless there is a way to using a compact representation of the DP matrix that I'm not aware of.

Are all of these nominally the same or are you trying to align
noise to noise?

Yes, they are nominally the same...they have at least 50% of the non-overlapping words of the shorter of the two sequences.

Ideally, something with the speed similar to BLAST.

I guess in an odd way my approach could get there as it essentially queries each string for "interesting" short sequences but I'd have to
check order ( howmany of these does it use etc). Last time I checked the
academic lit, IIRC this exact-string matching was an open research area maybe 
there have been advancements
in last few years that are trivial to code or exist in an academic's lab.

If there are, I haven't heard of any. My thought was to run a BLAST alignment on the two sequences using bl2seq. Then string together the non-overlapping HSPs and perform a global alignment on the regions in between the HSPs. This is easy enough, but I want to see if there is a solution already out there first.

Ryan


_______________________________________________
BBB mailing list
[email protected]
http://www.bioinformatics.org/mailman/listinfo/bbb

Reply via email to