Re: Interesting little regex
Yes, what are the unique occurrences of text in that string? I've run the code and I'm still not exactly sure what it's supposed to do. use Data::Dump qw/ dump /; $a=abcdex4; $a=~s{((\w+?)(??{!$b{$^N}++?(?=):(?!)}))}{($1)}xg; print $a\n; print dump(\%b), \n; (a)(b)(c)(d)(e)(ab)(cd)(ea)(bc)(de)(abc)de { a = 3, ab = 2, abc = 1, b = 2, bc = 1, c = 2, cd = 1, d = 3, de = 2, e = 3, ea = 1 } There is 1 unique occurrence of 'abc', 2 occurrences of 'ab' (not contained in the other occurrence of 'abc'), and 3 occurrences of 'a' (not contained in the other occurrences of 'abc' and 'ab'). If you have a stream of text of variable size with delimiters of varying length embedded in the string and values of varying length: delim1abcdefgdel2hijklmnopqrstd3uvwxyz (where we have delim1, del2 and d3 as delimiters, and abcdefg, hijklmnopqrst and uvwxyz as values) how can we get those out efficiently and without a lot of programming? That regex does that. The !$b{$^N}++ portion of the regex is replace in live code with a subroutine call that returns true or false based on whether we have a recognized delimiter yet, and the regex is slightly different so that we can capture the value as well. I just thought it was a neat little regex that qualified as a FWP and pass it along to my perl monger and user groups as well. -- Alan
Re: Interesting little regex
On Fri, Feb 24, 2006 at 01:40:15PM -0700, Alan Young wrote: Yes, what are the unique occurrences of text in that string? I've run the code and I'm still not exactly sure what it's supposed to do. use Data::Dump qw/ dump /; $a=abcdex4; $a=~s{((\w+?)(??{!$b{$^N}++?(?=):(?!)}))}{($1)}xg; print $a\n; print dump(\%b), \n; (a)(b)(c)(d)(e)(ab)(cd)(ea)(bc)(de)(abc)de { a = 3, ab = 2, abc = 1, b = 2, bc = 1, c = 2, cd = 1, d = 3, de = 2, e = 3, ea = 1 } There is 1 unique occurrence of 'abc', 2 occurrences of 'ab' (not contained in the other occurrence of 'abc'), and 3 occurrences of 'a' (not contained in the other occurrences of 'abc' and 'ab'). I'm afraid I'm not getting what you mean by unique occurrence... Why is there only one unique occurrence of 'abc', when the string contains 'abc' four times? Why are there two unique occurrences of 'de', but only one of 'bc'? Why are there no unique occurences at all of 'abcd'? I just thought it was a neat little regex that qualified as a FWP and pass it along to my perl monger and user groups as well. It is a neat regex, that's why I want to understand what it's for. :) Ronald
Re: Interesting little regex
I'm afraid I'm not getting what you mean by unique occurrence... Why is there only one unique occurrence of 'abc', when the string contains 'abc' four times? Why are there two unique occurrences of 'de', but only one of 'bc'? Why are there no unique occurences at all of 'abcd'? I'm probably not stating myself well (I'm known for that). Maybe unique occurrence isn't what I'm really trying to say. If we have a stream of text (say we have a file that is several 10s of million bytes in size) and we're limited to how much we can load into memory at a time, or we're recieving it over a connection of some kind (e.g., serial or tcp) and we have a varying number of delimiters, with a varying delimiter length (delimiter1, delim2, del3). The value of the delimiter is the delimiter and an unspecified number of bytes, up to the next known delimiter. (value of delimiter 'del2' in the string 'del1abcdel2def' is 'del2def'. I don't understand exactly why this format was decided upon, this was the poser handed to my co-worker and this is what he came up with as a proof of concept. Of course, this requires that no delimiter can be a substring of another. Better? -- Alan
Re: Interesting little regex
On Fri, Feb 24, 2006 at 03:20:27PM -0700, Alan Young wrote: I'm afraid I'm not getting what you mean by unique occurrence... Why is there only one unique occurrence of 'abc', when the string contains 'abc' four times? Why are there two unique occurrences of 'de', but only one of 'bc'? Why are there no unique occurences at all of 'abcd'? I'm probably not stating myself well (I'm known for that). Maybe unique occurrence isn't what I'm really trying to say. If we have a stream of text (say we have a file that is several 10s of million bytes in size) and we're limited to how much we can load into memory at a time, or we're recieving it over a connection of some kind (e.g., serial or tcp) and we have a varying number of delimiters, with a varying delimiter length (delimiter1, delim2, del3). The value of the delimiter is the delimiter and an unspecified number of bytes, up to the next known delimiter. (value of delimiter 'del2' in the string 'del1abcdel2def' is 'del2def'. I don't understand exactly why this format was decided upon, this was the poser handed to my co-worker and this is what he came up with as a proof of concept. Of course, this requires that no delimiter can be a substring of another. Better? It's making more sense now. Thank you for taking the time to explain it to me! Ronald
Re: Interesting little regex
AY == Alan Young [EMAIL PROTECTED] writes: AY I know, replying to myself. AY Parsing the KJV Bible took about 7 seconds with this: AY #!/usr/bin/perl -w AY use strict; AY my $text = do { AY open my $T, './kjv10.txt' or die Couldn't open kjv10.txt: $!\n; AY local $/; AY $T; AY }; use File::Slurp ; my $text = readfile( 'bibble' ) ; much faster that way. AY my %unique; AY $text =~ s{( AY (\b\w+(?:['-]+\w+)*\b) why the multiple ['-] inside the words? could those chars ever begin or end words? so just [\w'-]+ should be fine there. AY (??{!$unique{$^N}++?(?=):(?!)}) i am not sure why you do that boolean trick there. i have seen it before (and actually use it somewhere but what is its purpose here? AY) AY }{ AY$1 since you just replace the word by itself, why use s///? m// will get the same results and should be much faster. AY }xg; AY print $_ = $unique{$_}\n for sort keys %unique; if you want raw speed, that makes lots of calls to print which is very slow as it needs to invoke stdio code for each call. this should be faster (even with the ram usage): print map $_ = $unique{$_}\n, sort keys %unique; i am curious how much faster it will run with all those changes. :) uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org
Re: Interesting little regex
' can end words in English; the most obvious being posessive plurals, though it can also be used for some contractions as well. -- H4sICNoBwDoAA3NpZwA9jbsNwDAIRHumuC4NklvXTOD0KSJEnwU8fHz4Q8M9i3sGzkS7BBrm OkCTwsycb4S3DloZuMIYeXpLFqw5LaMhXC2ymhreVXNWMw9YGuAYdfmAbwomoPSyFJuFn2x8 Opr8bBBidcc= -- MOTD on Prickle-Prickle, the 54th of Chaos, in the YOLD 3172: The man who makes no mistakes does not usually make anything.
Re: Interesting little regex
On Thu, 23 Feb 2006 13:02:32 -0500, Uri Guttman wrote: AY $text =~ s{( AY (\b\w+(?:['-]+\w+)*\b) why the multiple ['-] inside the words? could those chars ever begin or end words? so just [\w'-]+ should be fine there. That reminds me, only earlier today I looked at the word frequency counter code in perlfaq6. http://perldoc.perl.org/perlfaq6.html#How-can-I-print-out-a-word-frequency-or-line-frequency-summary%3f I'm a bit puzzled by the comment: while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses `sheep' $seen{$1}++; } I'm wondering why they do it this way... -- Bart.
Re: Interesting little regex
Updated script at bottom. On 2/23/06, Uri Guttman [EMAIL PROTECTED] wrote: AY $text =~ s{( AY (\b\w+(?:['-]+\w+)*\b) why the multiple ['-] inside the words? could those chars ever begin or end words? so just [\w'-]+ should be fine there. It's possible to have multi-hyphenated words. I didn't think it was worth the time to figure out how to handle that and single apostrophe words at the same time. Besides, I'm not verifying the accuracy of the text. In the spirit of testing though, I changed it to (\b[\w'-]*\b) and it took 40 seconds and found 's and ' as words where the original did not. AY (??{!$unique{$^N}++?(?=):(?!)}) i am not sure why you do that boolean trick there. i have seen it before (and actually use it somewhere but what is its purpose here? Well, as we were looking at it, we realized it wasn't really necessary for the word parsing. What is was originally doing, however, was finding the unique occurrences in a string of text. Basically, if the match was not in the hash then (?=) would force the regex to succeed, otherwise it would force it to fail. This is the way I understand it: (??{code}) replaces the regex at the current pos() with the result of the code block. If the the match ($^N) was not in the hash, then it would auto-vivify the key and increment it and return (?!) which is a negative lookahead on nothing, which always fails so we force it to backtrack and try again. If the match ( $^N) is in the hash, then it increments the value and returns (?=) which is a positive lookahead on nothing, which always succeeds so we continue on. I'm still wrapping my brain around this concept so I may have it twisted a little. Changing the regex to 1 while $text =~ m{( (\b\w+(?:['-]+\w+)*\b) (?{!$unique{$^N}++}) ) }xg; dropped the time down to 3s. since you just replace the word by itself, why use s///? m// will get the same results and should be much faster. There was no appreciable difference between the two types of regexes (see my code below). AY print $_ = $unique{$_}\n for sort keys %unique; if you want raw speed, that makes lots of calls to print which is very slow as it needs to invoke stdio code for each call. this should be faster (even with the ram usage): print map $_ = $unique{$_}\n, sort keys %unique; Didn't seem to make a difference, but I like this way better. Seems more perlish. Before changing the regex as indicated where I explained how we didn't really need to do it that way :/, and with your other changes the speed was still right around 7s (using time ./simple.pl). However, memory usage was noticeably (if not significantly) improved. #!/usr/bin/perl -w use strict; use File::Slurp; my $text = read_file( './kjv10.txt' ); my %unique; if ( 0 ) { print substitution\n; # $text =~ s{( # (\b\w+(?:['-]+\w+)*\b) # (??{!$unique{$^N}++?(?=):(?!)}) # ) # }{}xg; $text =~ s{( (\b\w+(?:['-]+\w+)*\b) (?{$unique{$^N}++}) ) }{}xg; } else { print while loop\n; # 1 while $text =~ m{( #(\b\w+(?:['-]+\w+)*\b) #(??{!$unique{$^N}++?(?=):(?!)}) # ) # }xg; 1 while $text =~ m{( (\b\w+(?:['-]+\w+)*\b) (?{!$unique{$^N}++}) ) }xg; } print map $_ = $unique{$_}\n, sort keys %unique; -- Alan
Re: Interesting little regex
AY == Alan Young [EMAIL PROTECTED] writes: AY Updated script at bottom. AY On 2/23/06, Uri Guttman [EMAIL PROTECTED] wrote: AY $text =~ s{( AY (\b\w+(?:['-]+\w+)*\b) why the multiple ['-] inside the words? could those chars ever begin or end words? so just [\w'-]+ should be fine there. AY It's possible to have multi-hyphenated words. I didn't think it was AY worth the time to figure out how to handle that and single apostrophe AY words at the same time. Besides, I'm not verifying the accuracy of AY the text. AY In the spirit of testing though, I changed it to (\b[\w'-]*\b) and it AY took 40 seconds and found 's and ' as words where the original did AY not. no wonder it took so long. you matched the null string between each pair of word boundaries. you need a +, not * there. AY This is the way I understand it: AY (??{code}) replaces the regex at the current pos() with the result AY of the code block. AY If the the match ($^N) was not in the hash, then it would auto-vivify AY the key and increment it and return (?!) which is a negative lookahead AY on nothing, which always fails so we force it to backtrack and try AY again. AY If the match ( $^N) is in the hash, then it increments the value and AY returns (?=) which is a positive lookahead on nothing, which always AY succeeds so we continue on. i understand the boolean thing as i said previously. i was asking why you used it there. i see no reason if all you are doing is word counting. AY Changing the regex to AY 1 while $text =~ m{( AY (\b\w+(?:['-]+\w+)*\b) AY (?{!$unique{$^N}++}) AY) AY }xg; AY dropped the time down to 3s. since you just replace the word by itself, why use s///? m// will get the same results and should be much faster. AY There was no appreciable difference between the two types of regexes AY (see my code below). try this: $unique{$1}++ while $text =~ m/([\w'-]+)/g ; use the benchmark module to compare the speeds. make sure you don't do destructive parsing which some of your examples seem to to. uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org