You assumed that \s will delimit the tokens. That's not the case (see
the original message, the interesting data can occur anywhere). So you
can't tokenize and do a simple hash lookup. If you benchmark 6000
Acutally I believe the OP said that there were still delimters required,
they just
Given how you frame the problem, then the hash lookup isn't even an
option! No question, 6000+ string searches will be slow vs. a
trie. Given the varying requirements we all encounter, day-to-day, I
think this is an interesting exercise. Thanks for sharing these modules, Ted.
The OP
On Sun, 06 Feb 2011 11:49:56 -0500 Charlie creit...@rcn.com wrote:
C Given how you frame the problem, then the hash lookup isn't even an
C option! No question, 6000+ string searches will be slow vs. a trie.
C Given the varying requirements we all encounter, day-to-day, I think
C this is an
Too bad Text::Match::FastAlternatives's return values aren't more useful
i.e; the matched position. This with a /g equivalent and Bob's your uncle.
___
Boston-pm mailing list
Boston-pm@mail.pm.org
http://mail.pm.org/mailman/listinfo/boston-pm
is faster. but but not tokenizing gives
your grammers more flexibility I think.
-Original message-
From: Charlie creit...@rcn.com
To: Ted Zlatanov t...@lifelogs.com
Cc: boston...@pm.org
Sent: Sun, Feb 6, 2011 16:49:56 GMT+00:00
Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency
hi
Ok, I've been reading over the responses you've been getting and I just
have to ask everyone.
What's wrong with something like this:
while($line=){
foreach my $name (@names){
$line ~= s/$name/prefix_$1/g;
}
}
I know it seems kind of
MP == Martyn Peck m...@mwpnet.com writes:
MP What's wrong with something like this:
MP while($line=){
MP foreach my $name (@names){
MP $line ~= s/$name/prefix_$1/g;
MP }
MP }
it is O( N^2 ) which is very slow for large data sets.
MP I know it seems
At Fri Feb 04 18:53:09 -0500 2011, Uri Guttman wrote:
that will kill your cpu. alternations are very slow since they have to
go back and try from the beginning of the list each time.
Since we're talking about literals, this hasn't been true since 2007,
with the release of perl 5.10. Perl now
Short answer, no, Perl regex will not build an optimal lookup of a token
into your set of 6000 names. In general, if speed is the issue, do not use
regex. It does not scale.
Also, be clear on the 2 problems at hand: 1) tokenizing 1GB of input text
and 2) adding a prefix to identified
Hi folks,
Problem:
I have a 900 Meg text file, containing random text. I also have a list
of 6000 names (alphanumeric strings) that occur in the random text.
I need to tag a prefix on to each occurrence of each of these 6000
names.
My premise:
I believe a regex would give the simplest and most
KS == Kripa Sundar kripa.sun...@synopsys.com writes:
KS I have a 900 Meg text file, containing random text. I also have a list
KS of 6000 names (alphanumeric strings) that occur in the random text.
KS I need to tag a prefix on to each occurrence of each of these 6000
KS names.
KS My
To: boston-pm@mail.pm.org boston-pm@mail.pm.org
Sent: Sat, Feb 5, 2011 00:53:35 GMT+00:00
Subject: Re: [Boston.pm] Q: giant-but-simple regex efficiency
Thanks for the prompt replies, folks!
Unfortunately, my names can be embedded in larger words of the input
text, as long as they are delimited by certain
12 matches
Mail list logo