On Mon, Jul 6, 2009 at 11:35 AM, Peter Rice <[email protected]> wrote: > > Peter C wrote: > > Hi Peter R. et al, > > > > I gather EMBOSS is looking for feedback for new applications (given > > the recent funding from the BBSRC - congratulations again). How about > > suggestions for extensions to existing EMBOSS applications? > > > > I've used bits of EMBOSS for several years now (thank you!). Something > > I have sometimes wanted to do is a many-to-many pairwise sequence > > alignment with the EMBOSS tools needle and water. > > > > Right now, needle and water take two files (here referred to as A and > > B), file A has just one sequence, and file B can have one or more > > sequences. I'd like to be able to supply two files both with multiple > > entries, and have needle/water do pairwise alignments between all the > > sequences in A against all the sequences in B. This might be useful > > for finding reciprocal best hits in comparative genomics (as an slower > > but exact alternative to FASTA or BLAST). > > The application is easy to add (after the release) > > The usual problem with all-against-all is that it involves loading one > of the inputs as a sequence set entirely in memory - to avoid reading > one input many times over.
Right - and it would be difficult to decide if in memory vs reading the file many times is best in general without some specific use cases. [I suppose you could do something a bit more cunning like start by caching the sequences as you read them read for re-use, but if the number of sequences crosses a threshold, stop caching and switch to re-reading the file for subsequence loops?] > We have an application supermatcher which does this - the first sequence > is streamed through, the second is a sequence set loaded into memory. It > uses work matching to find seed alignments then runs a limited alignment > around the hits. > > superwater would be a possible name (or superneedle). If you see many-to-many versions of water and needle as a separate applications, then those names sound fine. > How popular would such a program be? I don't know - as I said, this is more of suggestion than a request. I don't *need* this tool, but there have been occasions in the past where I would have tried using it if it had existed. Perhaps others on the list can think of a better uses for this tool idea? > How large would the smaller input set be? Hard to say without specific examples in mind. For some hand waving upper limits, for comparative genomics of bacteria using protein sequences, you might have a few thousand in each file. If I was trying this as part of an ad-hoc clustering algorithm (all-against-all), again maybe a few thousand sequences. In practice, a heuristic tool like supermatcher (or FASTA or BLAST) would probably be more sensible for large datasets like this due to the computational time. I see needle and water as most useful on smaller datasets where the runtime cost of using an exact algorithm isn't too high. Therefore many-to-many needle/water searches may be best targeted at smaller sequence files. Things might be different with a multicore or GPU/OpenCL version of needle and water ;) Anyway, unless someone else thinks a many-to-many version of needle and water would be useful, I wouldn't expect you to implement this. I'm just putting the idea forward for discussion. Regards, Peter C. _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
