Re: [Boston.pm] Q: giant-but-simple regex efficiency

Ted Zlatanov Sun, 06 Feb 2011 09:19:48 -0800

On Sun, 06 Feb 2011 11:49:56 -0500 Charlie <[email protected]> wrote:


C> Given how you frame the problem, then the hash lookup isn't even an
C> option!  No question, 6000+ string searches will be slow vs. a trie.
C> Given the varying requirements we all encounter, day-to-day, I think
C> this is an interesting exercise.  Thanks for sharing these modules,
C> Ted.

Sure.  I think this is a fascinating area.  I was looking into this just
recently because a biologist asked me about microarray analysis.  There,
they have tens of thousands of expressed proteins with a score (vs. a
control) and they try to find the strongest *correlated* expressions of
certain proteins, which are basically substrings of a big text body
(there's a lot more to it, of course, including that some proteins are
known to be grouped and some have known functions).  I found
http://www.bioconductor.org/ which uses the R statistical language to
qualify these, but was investigating Perl approaches to the same.

C> The OP indicated that the text can be tokenized:
KS> Unfortunately, my names can be embedded in larger "words" of the input
KS> text, as long as they are delimited by certain punctuation.

On Sun, 06 Feb 2011 10:25:43 -0500 [email protected] wrote: 

b> Acutally I believe the OP said that there were still delimters required,
b> they just weren't \s so one CAN still tokenize

I didn't parse that the same way, sorry.  Definitely, if the input can
be tokenized you'd have a good shot at the split+lookup approach.

Ted

_______________________________________________
Boston-pm mailing list
[email protected]
http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Q: giant-but-simple regex efficiency

Reply via email to