Re: Interesting little regex

2006-02-24 Thread Alan Young
 Yes, what are the unique occurrences of text in that string?  I've run the
 code and I'm still not exactly sure what it's supposed to do.

 use Data::Dump qw/ dump /;

 $a=abcdex4;
 $a=~s{((\w+?)(??{!$b{$^N}++?(?=):(?!)}))}{($1)}xg;
 print $a\n;
 print dump(\%b), \n;

 (a)(b)(c)(d)(e)(ab)(cd)(ea)(bc)(de)(abc)de
 { a = 3, ab = 2, abc = 1, b = 2, bc = 1, c = 2, cd = 1, d = 3, de
 = 2, e = 3, ea = 1 }

There is 1 unique occurrence of 'abc', 2 occurrences of 'ab' (not
contained in the other occurrence of 'abc'), and 3 occurrences of 'a'
(not contained in the other occurrences of 'abc' and 'ab').

If you have a stream of text of variable size with delimiters of
varying length embedded in the string and values of varying length:

delim1abcdefgdel2hijklmnopqrstd3uvwxyz

(where we have delim1, del2 and d3 as delimiters, and abcdefg,
hijklmnopqrst and uvwxyz as values)

how can we get those out efficiently and without a lot of programming?
 That regex does that.

The !$b{$^N}++ portion of the regex is replace in live code with a
subroutine call that returns true or false based on whether we have a
recognized delimiter yet, and the regex is slightly different so that
we can capture the value as well.

I just thought it was a neat little regex that qualified as a FWP and
pass it along to my perl monger and user groups as well.
--
Alan


Re: Interesting little regex

2006-02-24 Thread Ronald J Kimball
On Fri, Feb 24, 2006 at 01:40:15PM -0700, Alan Young wrote:
  Yes, what are the unique occurrences of text in that string?  I've run the
  code and I'm still not exactly sure what it's supposed to do.
 
  use Data::Dump qw/ dump /;
 
  $a=abcdex4;
  $a=~s{((\w+?)(??{!$b{$^N}++?(?=):(?!)}))}{($1)}xg;
  print $a\n;
  print dump(\%b), \n;
 
  (a)(b)(c)(d)(e)(ab)(cd)(ea)(bc)(de)(abc)de
  { a = 3, ab = 2, abc = 1, b = 2, bc = 1, c = 2, cd = 1, d = 3, de
  = 2, e = 3, ea = 1 }
 
 There is 1 unique occurrence of 'abc', 2 occurrences of 'ab' (not
 contained in the other occurrence of 'abc'), and 3 occurrences of 'a'
 (not contained in the other occurrences of 'abc' and 'ab').

I'm afraid I'm not getting what you mean by unique occurrence...  Why is
there only one unique occurrence of 'abc', when the string contains 'abc'
four times?  Why are there two unique occurrences of 'de', but only one of
'bc'?  Why are there no unique occurences at all of 'abcd'?

 I just thought it was a neat little regex that qualified as a FWP and
 pass it along to my perl monger and user groups as well.

It is a neat regex, that's why I want to understand what it's for.  :)

Ronald


Re: Interesting little regex

2006-02-24 Thread Alan Young
 I'm afraid I'm not getting what you mean by unique occurrence...  Why is
 there only one unique occurrence of 'abc', when the string contains 'abc'
 four times?  Why are there two unique occurrences of 'de', but only one of
 'bc'?  Why are there no unique occurences at all of 'abcd'?

I'm probably not stating myself well (I'm known for that).  Maybe
unique occurrence isn't what I'm really trying to say.

If we have a stream of text (say we have a file that is several 10s of
million bytes in size) and we're limited to how much we can load into
memory at a time, or we're recieving it over a connection of some kind
(e.g., serial or tcp) and we have a varying number of delimiters, with
a varying delimiter length (delimiter1, delim2, del3).  The value of
the delimiter is the delimiter and an unspecified number of bytes, up
to the next known delimiter. (value of delimiter 'del2' in the string
'del1abcdel2def' is 'del2def'.

I don't understand exactly why this format was decided upon, this was
the poser handed to my co-worker and this is what he came up with as a
proof of concept.  Of course, this requires that  no delimiter can be
a substring of another.

Better?
--
Alan


Re: Interesting little regex

2006-02-24 Thread Ronald J Kimball
On Fri, Feb 24, 2006 at 03:20:27PM -0700, Alan Young wrote:
  I'm afraid I'm not getting what you mean by unique occurrence...  Why is
  there only one unique occurrence of 'abc', when the string contains 'abc'
  four times?  Why are there two unique occurrences of 'de', but only one of
  'bc'?  Why are there no unique occurences at all of 'abcd'?
 
 I'm probably not stating myself well (I'm known for that).  Maybe
 unique occurrence isn't what I'm really trying to say.
 
 If we have a stream of text (say we have a file that is several 10s of
 million bytes in size) and we're limited to how much we can load into
 memory at a time, or we're recieving it over a connection of some kind
 (e.g., serial or tcp) and we have a varying number of delimiters, with
 a varying delimiter length (delimiter1, delim2, del3).  The value of
 the delimiter is the delimiter and an unspecified number of bytes, up
 to the next known delimiter. (value of delimiter 'del2' in the string
 'del1abcdel2def' is 'del2def'.
 
 I don't understand exactly why this format was decided upon, this was
 the poser handed to my co-worker and this is what he came up with as a
 proof of concept.  Of course, this requires that  no delimiter can be
 a substring of another.
 
 Better?

It's making more sense now.  Thank you for taking the time to explain it to
me!

Ronald


Re: Interesting little regex

2006-02-23 Thread Uri Guttman
 AY == Alan Young [EMAIL PROTECTED] writes:

  AY I know, replying to myself.
  AY Parsing the KJV Bible took about 7 seconds with this:

  AY #!/usr/bin/perl -w

  AY use strict;

  AY my $text = do {
  AY   open my $T, './kjv10.txt' or die Couldn't open kjv10.txt: $!\n;
  AY   local $/;
  AY   $T;
  AY };

use File::Slurp ;

my $text = readfile( 'bibble' ) ;

much faster that way.

  AY my %unique;

  AY $text =~ s{(
  AY  (\b\w+(?:['-]+\w+)*\b)

why the multiple ['-] inside the words? could those chars ever begin or
end words? so just [\w'-]+ should be fine there.

  AY  (??{!$unique{$^N}++?(?=):(?!)})

i am not sure why you do that boolean trick there. i have seen it before
(and actually use it somewhere but what is its purpose here?

  AY)
  AY   }{
  AY$1

since you just replace the word by itself, why use s///? m// will get
the same results and should be much faster.

  AY   }xg;

  AY print $_ = $unique{$_}\n for sort keys %unique;

if you want raw speed, that makes lots of calls to print which is very
slow as it needs to invoke stdio code for each call. this should be
faster (even with the ram usage):

print map $_ = $unique{$_}\n, sort keys %unique;

i am curious how much faster it will run with all those changes. :)

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org


Re: Interesting little regex

2006-02-23 Thread Jerrad Pierce
' can end words in English; the most obvious being posessive plurals,
though it can also be used for some contractions as well.
-- 
H4sICNoBwDoAA3NpZwA9jbsNwDAIRHumuC4NklvXTOD0KSJEnwU8fHz4Q8M9i3sGzkS7BBrm
OkCTwsycb4S3DloZuMIYeXpLFqw5LaMhXC2ymhreVXNWMw9YGuAYdfmAbwomoPSyFJuFn2x8
Opr8bBBidcc=
--
MOTD on Prickle-Prickle, the 54th of Chaos, in the YOLD 3172:
The man who makes no mistakes does not usually make anything.


Re: Interesting little regex

2006-02-23 Thread Bart Lateur
On Thu, 23 Feb 2006 13:02:32 -0500, Uri Guttman wrote:

  AY $text =~ s{(
  AY  (\b\w+(?:['-]+\w+)*\b)

why the multiple ['-] inside the words? could those chars ever begin or
end words? so just [\w'-]+ should be fine there.

That reminds me, only earlier today I looked at the word frequency
counter code in perlfaq6.


http://perldoc.perl.org/perlfaq6.html#How-can-I-print-out-a-word-frequency-or-line-frequency-summary%3f

I'm a bit puzzled by the comment:

while ( /(\b[^\W_\d][\w'-]+\b)/g ) {   # misses `sheep'
$seen{$1}++;
}

I'm wondering why they do it this way...

-- 
Bart.


Re: Interesting little regex

2006-02-23 Thread Alan Young
Updated script at bottom.
On 2/23/06, Uri Guttman [EMAIL PROTECTED] wrote:
   AY $text =~ s{(
   AY  (\b\w+(?:['-]+\w+)*\b)

 why the multiple ['-] inside the words? could those chars ever begin or
 end words? so just [\w'-]+ should be fine there.

It's possible to have multi-hyphenated words.  I didn't think it was
worth the time to figure out how to handle that and single apostrophe
words at the same time.  Besides, I'm not verifying the accuracy of
the text.

In the spirit of testing though, I changed it to (\b[\w'-]*\b) and it
took 40 seconds and found 's and ' as words where the original did
not.

   AY  (??{!$unique{$^N}++?(?=):(?!)})

 i am not sure why you do that boolean trick there. i have seen it before
 (and actually use it somewhere but what is its purpose here?

Well, as we were looking at it, we realized it wasn't really necessary
for the word parsing.  What is was originally doing, however, was
finding the unique occurrences in a string of text.

Basically, if the match was not in the hash then (?=) would force the
regex to succeed, otherwise it would force it to fail.

This is the way I understand it:

(??{code}) replaces the regex at the current pos() with the result
of the code block.

If the the match ($^N) was not in the hash, then it would auto-vivify
the key and increment it and return (?!) which is a negative lookahead
on nothing, which always fails so we force it to backtrack and try
again.

If the match ( $^N) is in the hash, then it increments the value and
returns (?=) which is a positive lookahead on nothing, which always
succeeds so we continue on.

I'm still wrapping my brain around this concept so I may have it
twisted a little.

Changing the regex to

  1 while $text =~ m{(
(\b\w+(?:['-]+\w+)*\b)
(?{!$unique{$^N}++})
   )
  }xg;

dropped the time down to 3s.

 since you just replace the word by itself, why use s///? m// will get
 the same results and should be much faster.

There was no appreciable difference between the two types of regexes
(see my code below).

   AY print $_ = $unique{$_}\n for sort keys %unique;

 if you want raw speed, that makes lots of calls to print which is very
 slow as it needs to invoke stdio code for each call. this should be
 faster (even with the ram usage):

 print map $_ = $unique{$_}\n, sort keys %unique;

Didn't seem to make a difference, but I like this way better.  Seems
more perlish.

Before changing the regex as indicated where I explained how we didn't
really need to do it that way :/, and with your other changes the
speed was still right around 7s (using time ./simple.pl).  However,
memory usage was noticeably (if not significantly) improved.

#!/usr/bin/perl -w

use strict;

use File::Slurp;

my $text = read_file( './kjv10.txt' );

my %unique;

if ( 0 ) {
print substitution\n;

#  $text =~ s{(
# (\b\w+(?:['-]+\w+)*\b)
# (??{!$unique{$^N}++?(?=):(?!)})
#   )
#  }{}xg;

  $text =~ s{(
 (\b\w+(?:['-]+\w+)*\b)
 (?{$unique{$^N}++})
   )
  }{}xg;

} else {

  print while loop\n;

#  1 while $text =~ m{(
#(\b\w+(?:['-]+\w+)*\b)
#(??{!$unique{$^N}++?(?=):(?!)})
#   )
#  }xg;

  1 while $text =~ m{(
(\b\w+(?:['-]+\w+)*\b)
(?{!$unique{$^N}++})
   )
  }xg;

}

print map $_ = $unique{$_}\n, sort keys %unique;
--
Alan


Re: Interesting little regex

2006-02-23 Thread Uri Guttman
 AY == Alan Young [EMAIL PROTECTED] writes:

  AY Updated script at bottom.
  AY On 2/23/06, Uri Guttman [EMAIL PROTECTED] wrote:
  AY $text =~ s{(
  AY (\b\w+(?:['-]+\w+)*\b)
   
   why the multiple ['-] inside the words? could those chars ever begin or
   end words? so just [\w'-]+ should be fine there.

  AY It's possible to have multi-hyphenated words.  I didn't think it was
  AY worth the time to figure out how to handle that and single apostrophe
  AY words at the same time.  Besides, I'm not verifying the accuracy of
  AY the text.

  AY In the spirit of testing though, I changed it to (\b[\w'-]*\b) and it
  AY took 40 seconds and found 's and ' as words where the original did
  AY not.

no wonder it took so long. you matched the null string between each pair
of word boundaries. you need a +, not * there.


  AY This is the way I understand it:

  AY (??{code}) replaces the regex at the current pos() with the result
  AY of the code block.

  AY If the the match ($^N) was not in the hash, then it would auto-vivify
  AY the key and increment it and return (?!) which is a negative lookahead
  AY on nothing, which always fails so we force it to backtrack and try
  AY again.

  AY If the match ( $^N) is in the hash, then it increments the value and
  AY returns (?=) which is a positive lookahead on nothing, which always
  AY succeeds so we continue on.

i understand the boolean thing as i said previously. i was asking why
you used it there. i see no reason if all you are doing is word
counting. 

  AY Changing the regex to

  AY   1 while $text =~ m{(
  AY (\b\w+(?:['-]+\w+)*\b)
  AY (?{!$unique{$^N}++})
  AY)
  AY   }xg;

  AY dropped the time down to 3s.

   since you just replace the word by itself, why use s///? m// will get
   the same results and should be much faster.

  AY There was no appreciable difference between the two types of regexes
  AY (see my code below).

try this:

$unique{$1}++ while $text =~ m/([\w'-]+)/g ;

use the benchmark module to compare the speeds. make sure you don't do
destructive parsing which some of your examples seem to to.

uri

-- 
Uri Guttman  --  [EMAIL PROTECTED]   http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs    http://jobs.perl.org