On Sun, 06 Feb 2011 11:49:56 -0500 Charlie <[email protected]> wrote:
C> Given how you frame the problem, then the hash lookup isn't even an C> option! No question, 6000+ string searches will be slow vs. a trie. C> Given the varying requirements we all encounter, day-to-day, I think C> this is an interesting exercise. Thanks for sharing these modules, C> Ted. Sure. I think this is a fascinating area. I was looking into this just recently because a biologist asked me about microarray analysis. There, they have tens of thousands of expressed proteins with a score (vs. a control) and they try to find the strongest *correlated* expressions of certain proteins, which are basically substrings of a big text body (there's a lot more to it, of course, including that some proteins are known to be grouped and some have known functions). I found http://www.bioconductor.org/ which uses the R statistical language to qualify these, but was investigating Perl approaches to the same. C> The OP indicated that the text can be tokenized: KS> Unfortunately, my names can be embedded in larger "words" of the input KS> text, as long as they are delimited by certain punctuation. On Sun, 06 Feb 2011 10:25:43 -0500 [email protected] wrote: b> Acutally I believe the OP said that there were still delimters required, b> they just weren't \s so one CAN still tokenize I didn't parse that the same way, sorry. Definitely, if the input can be tokenized you'd have a good shot at the split+lookup approach. Ted _______________________________________________ Boston-pm mailing list [email protected] http://mail.pm.org/mailman/listinfo/boston-pm

