2009/8/4 Ryan Golhar <[email protected]>: >>> I'm trying to perform a large amount of sequence alignments of long DNA >>> sequences, some up to 163,000+ bp in length. I was trying to use the >>> standard Needleman-Wunsch algorithm, but the matrix used requires a >>> large amount of memory...about 100 GB of memory. This obviously won't >>> work. >> >> How many were you trying to align? You mean 163kb or 163Mb? >> I was looking for test or comparisons for some alignment code I had which >> indexed the target sequences, don't recall the suggestions >> for that discussion but I was able to do simple genomes reasonably well ( >> I think I used 2 strains of e coli or something about 5 megs long) >> on a desktop. If you can find responses to my request from a few years ago >> that may ( or may not ) help. I'd offer my code, and indeed I think >> I have it on a website, but I stopped development and not sure >> it is nearly useful as-is unless you just want coarse alignment on >> two similar sequences. > > Hundreds of thousands. I'm trying to eliminate duplicates or near > duplicates (>90% similarity). I'm using the methodology from cd-hit-est. > However I'm not successful in getting that application to run on the number > of sequences I have. Right now, I'm trying to cluster the nt database, > however later I would like to cluster other sequences from other sources.
First thing that came to mind when I read the above was cd-hit. What is cd-hit-est and how come it fails? I'm curious because I'm maintaining (or was) the cd-hit website for the project on bioinformatics.org: http://www.bioinformatics.org/cd-hit/ I'm planning to move that over into the wiki where it can (hopefully) stay more up to date. Dan. _______________________________________________ BBB mailing list [email protected] http://www.bioinformatics.org/mailman/listinfo/bbb
