Updated script at bottom. On 2/23/06, Uri Guttman <[EMAIL PROTECTED]> wrote: > AY> $text =~ s{( > AY> (\b\w+(?:['-]+\w+)*\b) > > why the multiple ['-] inside the words? could those chars ever begin or > end words? so just [\w'-]+ should be fine there.
It's possible to have multi-hyphenated words. I didn't think it was worth the time to figure out how to handle that and single apostrophe words at the same time. Besides, I'm not verifying the accuracy of the text. In the spirit of testing though, I changed it to (\b[\w'-]*\b) and it took 40 seconds and found 's and ' as words where the original did not. > AY> (??{!$unique{$^N}++?"(?=)":"(?!)"}) > > i am not sure why you do that boolean trick there. i have seen it before > (and actually use it somewhere but what is its purpose here? Well, as we were looking at it, we realized it wasn't really necessary for the word parsing. What is was originally doing, however, was finding the unique occurrences in a string of text. Basically, if the match was not in the hash then (?=) would force the regex to succeed, otherwise it would force it to fail. This is the way I understand it: (??{<code>}) replaces the regex at the current pos() with the result of the <code> block. If the the match ($^N) was not in the hash, then it would auto-vivify the key and increment it and return (?!) which is a negative lookahead on nothing, which always fails so we force it to backtrack and try again. If the match ( $^N) is in the hash, then it increments the value and returns (?=) which is a positive lookahead on nothing, which always succeeds so we continue on. I'm still wrapping my brain around this concept so I may have it twisted a little. Changing the regex to 1 while $text =~ m{( (\b\w+(?:['-]+\w+)*\b) (?{!$unique{$^N}++}) ) }xg; dropped the time down to 3s. > since you just replace the word by itself, why use s///? m// will get > the same results and should be much faster. There was no appreciable difference between the two types of regexes (see my code below). > AY> print "$_ => $unique{$_}\n" for sort keys %unique; > > if you want raw speed, that makes lots of calls to print which is very > slow as it needs to invoke stdio code for each call. this should be > faster (even with the ram usage): > > print map "$_ => $unique{$_}\n", sort keys %unique; Didn't seem to make a difference, but I like this way better. Seems more perlish. Before changing the regex as indicated where I explained how we didn't really need to do it that way :/, and with your other changes the speed was still right around 7s (using time ./simple.pl). However, memory usage was noticeably (if not significantly) improved. #!/usr/bin/perl -w use strict; use File::Slurp; my $text = read_file( './kjv10.txt' ); my %unique; if ( 0 ) { print "substitution\n"; # $text =~ s{( # (\b\w+(?:['-]+\w+)*\b) # (??{!$unique{$^N}++?"(?=)":"(?!)"}) # ) # }{}xg; $text =~ s{( (\b\w+(?:['-]+\w+)*\b) (?{$unique{$^N}++}) ) }{}xg; } else { print "while loop\n"; # 1 while $text =~ m{( # (\b\w+(?:['-]+\w+)*\b) # (??{!$unique{$^N}++?"(?=)":"(?!)"}) # ) # }xg; 1 while $text =~ m{( (\b\w+(?:['-]+\w+)*\b) (?{!$unique{$^N}++}) ) }xg; } print map "$_ => $unique{$_}\n", sort keys %unique; -- Alan