Peter C wrote: > [I suppose you could do something a bit more cunning like start by > caching the sequences as you read them read for re-use, but if the > number of sequences crosses a threshold, stop caching and switch > to re-reading the file for subsequence loops?]
Tricky. Rereading is not always possible - for example streamed standard input as the data source. > Perhaps others on the list can think of a better uses for this tool idea? Let's see what response we get. One never knows until the question is asked :-) >> How large would the smaller input set be? > > Hard to say without specific examples in mind. For some hand waving > upper limits, for comparative genomics of bacteria using protein > sequences, you might have a few thousand in each file. If I was trying > this as part of an ad-hoc clustering algorithm (all-against-all), again > maybe a few thousand sequences. In practice, a heuristic tool like > supermatcher (or FASTA or BLAST) would probably be more sensible > for large datasets like this due to the computational time. > > I see needle and water as most useful on smaller datasets where > the runtime cost of using an exact algorithm isn't too high. Therefore > many-to-many needle/water searches may be best targeted at > smaller sequence files. Things might be different with a multicore > or GPU/OpenCL version of needle and water ;) Multicore would be a possibility - at least on systems configured for it. We are looking into picking up methods from the BioManyCores project. > Anyway, unless someone else thinks a many-to-many version > of needle and water would be useful, I wouldn't expect you to > implement this. I'm just putting the idea forward for discussion. Implementing is easy - we could simply send you the code to install locally if nobody else needs it :-) After all, it is only a minor modification to the existing applications. regards, Peter _______________________________________________ EMBOSS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/emboss
