Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread belg4mit
You assumed that \s will delimit the tokens. That's not the case (see the original message, the interesting data can occur anywhere). So you can't tokenize and do a simple hash lookup. If you benchmark 6000 Acutally I believe the OP said that there were still delimters required, they just

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread Charlie
Given how you frame the problem, then the hash lookup isn't even an option! No question, 6000+ string searches will be slow vs. a trie. Given the varying requirements we all encounter, day-to-day, I think this is an interesting exercise. Thanks for sharing these modules, Ted. The OP

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread Ted Zlatanov
On Sun, 06 Feb 2011 11:49:56 -0500 Charlie creit...@rcn.com wrote: C Given how you frame the problem, then the hash lookup isn't even an C option! No question, 6000+ string searches will be slow vs. a trie. C Given the varying requirements we all encounter, day-to-day, I think C this is an

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread belg4mit
Too bad Text::Match::FastAlternatives's return values aren't more useful i.e; the matched position. This with a /g equivalent and Bob's your uncle. ___ Boston-pm mailing list Boston-pm@mail.pm.org http://mail.pm.org/mailman/listinfo/boston-pm

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-06 Thread Greg London
is faster. but but not tokenizing gives your grammers more flexibility I think. -Original message- From: Charlie creit...@rcn.com To: Ted Zlatanov t...@lifelogs.com Cc: boston...@pm.org Sent: Sun, Feb 6, 2011 16:49:56 GMT+00:00 Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Martyn Peck
hi Ok, I've been reading over the responses you've been getting and I just have to ask everyone. What's wrong with something like this: while($line=){ foreach my $name (@names){ $line ~= s/$name/prefix_$1/g; } } I know it seems kind of

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Uri Guttman
MP == Martyn Peck m...@mwpnet.com writes: MP What's wrong with something like this: MP while($line=){ MP foreach my $name (@names){ MP $line ~= s/$name/prefix_$1/g; MP } MP } it is O( N^2 ) which is very slow for large data sets. MP I know it seems

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Alex Vandiver
At Fri Feb 04 18:53:09 -0500 2011, Uri Guttman wrote: that will kill your cpu. alternations are very slow since they have to go back and try from the beginning of the list each time. Since we're talking about literals, this hasn't been true since 2007, with the release of perl 5.10. Perl now

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-05 Thread Charlie
Short answer, no, Perl regex will not build an optimal lookup of a token into your set of 6000 names. In general, if speed is the issue, do not use regex. It does not scale. Also, be clear on the 2 problems at hand: 1) tokenizing 1GB of input text and 2) adding a prefix to identified

[Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Kripa Sundar
Hi folks, Problem: I have a 900 Meg text file, containing random text. I also have a list of 6000 names (alphanumeric strings) that occur in the random text. I need to tag a prefix on to each occurrence of each of these 6000 names. My premise: I believe a regex would give the simplest and most

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Uri Guttman
KS == Kripa Sundar kripa.sun...@synopsys.com writes: KS I have a 900 Meg text file, containing random text. I also have a list KS of 6000 names (alphanumeric strings) that occur in the random text. KS I need to tag a prefix on to each occurrence of each of these 6000 KS names. KS My

Re: [Boston.pm] Q: giant-but-simple regex efficiency

2011-02-04 Thread Greg London
To: boston-pm@mail.pm.org boston-pm@mail.pm.org Sent: Sat, Feb 5, 2011 00:53:35 GMT+00:00 Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency Thanks for the prompt replies, folks! Unfortunately, my names can be embedded in larger words of the input text, as long as they are delimited by certain