subject:"Re\: RFC 72 \(v3\) Variable\-length lookbehind\: the regexp engine should also go backward."

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-17 Thread Hugo


mike mulligan writes:
:From: Hugo [EMAIL PROTECTED]
:Sent: Tuesday, September 12, 2000 2:54 PM
:
: 3. The regexp is matched left to right: first the lookbehind, then 'X',
: then '[yz]'.
:
:Thanks for the insight - I was stuck in my bad assumption that the optimized
:behavior was the only behavior.
:
:What I am not sure of is whether the "optimization" is ever dangerous.  In
:other words, is there ever a difference in end-result between, doing at each
:point: 1. test look-behind and then test the remainder of the regex, vs 2.
:test the remainder of the regex, and then test the look-behind?

Sometimes it may not be possible at all:
  "axbcxd" =~ /(?= a(.)b ) c\1d/x;

:I am without a motiviating example, but can anyone see utility in a
:non-greedy look-behind that operates in sense "2" above?   Syntax:
:(?=pat)?(?!pat)?Currently, a question-mark like this on a
:look-behind makes it optional, defeating the assertion's purpose.  If anyone
:has a good example, I'll take on writing a RFC.

Currently, a question mark like this on a lookbehind is apparently
ignored:
  crypt% ./perl -wle '/(?=test)?/'
  Quantifier unexpected on zero-length expression before HERE mark in regex 
m/(?=test)?  HERE / at -e line 1.
  Use of uninitialized value in pattern match (m//) at -e line 1.
  crypt% 

.. but I don't know why, since it could arguably be useful:
  / (?= (+|-) )? \d+ /x;
  print defined($1) ? "sign: '$1'\n" : "no sign\n";

Note that you can rewrite /(?=[aeiou])X[yz]/ as /X[yz](?=[aeiou]..)/
if you really want ...

Hugo

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-13 Thread Peter Heslin


On Tue, Sep 12, 2000 at 05:16:17AM +0100, Hugo wrote:
 :Simply put, I want variable-length lookbehind.
 
 The difficulty with variable-length lookbehind (let's call it
 VLLB) is this: suppose that we want to match "abcdef...xyz" =~
 /(?=x+)y/. In theory, to check the possible /x+/ matches in
 the right order [0] we need to check whether there we can match
 0 characters at offset 0 (no), then check whether we can match
 1 character at offset 0 or 0 characters from offset 1 (no), then
 check whether we can match 2 characters from offset 0, or 1 from
 1, or 0 from 2 ... and so it goes on. So we'd be trying a complete
 subpattern, of arbitrary complexity, against O(n^2) strings.
 
 Now in practice it isn't quite so bad - in the above example we
 could expect the optimiser to decide that there is a fixed
 substring 'y' at the start of the match, so we'd jump straight
 to an offset where we found that, then do only O(n) matches of
 the possible substrings preceding it. But that only kicks in
 where the optimiser finds something good to work with.

This is precisely why I would prefer the syntax I proposed (to which
Bart Lateur quite reasonably objected in an earlier post):  for me, an
important part of this proposal (and not merely an incidental one) is to
let the programmer (and not the optimiser) determine at what point to
make the lookbehind happen.

Bart pointed out that it would be more consistent to use the type of
syntax you have also (unconsciously?) used:

"abcdef...xyz" =~ /(?=x+)y/

rather than the syntax in the RFC, which might be:

"abcdef...xyz" =~ /y(?`=x+)/

It is true that my syntax is weirder, in that the lookbehind comes in
the middle, or even at the end of the regexp, but it has the advantage
of letting the programmer decide when the lookbehind should happen.
This is an advantage rather than a drawback because:

1) it means that the regexp engine *must*, by definition, match the
`y' in this example before it attempts the lookbehind -- thus the
O(n^2) behavior is avoided (or at least its avoidance is in the hands of
the programmer).

2) It frees the regexp code from the responsibility of doing any special
optimizations -- just match the part of the regexp that comes before the
lookbehind as usual, with normal optimizations, and only then is it
allowed to try matching backwards into the target string, starting
from the offset of the current prospective match.

3) The idea here is not to push work away from the optimizer and onto
the programmer (although that would be a nice side-effect for
implementability) -- rather, the point is to assume that the programmer
is using VLLB for a reason, and that that reason is because she has some
insight into the nature of the target data and *wants* to tell Perl to
jump back *here* rather than *there*, and then to continue forwards
with the match ...

I realize that the syntax is weird, but my feeling was that this
approach would be easier to implement and more useful, since on the one
hand the programmer would be doing the optimization, rather than Perl,
and on the other this added power to determine some part of how the
match is found would be desireable as an advanced feature (acknowledging
that it would also give the programmer the power write a really, really
inefficient regexp, if she so chose).

Incidentally, this is why I proposed new syntax, rather than extending
the old (?= ... ) syntax, as Mark Dominus suggested:  this feature
would work in a somewhat different way to the old lookbehind, and I
thought it might be good to make it look different.

If in any case the regexp code is unlikely to be rewritten from ground
up, as Mark says in another post, then there may be little chance of
this feature being implemented; nor does there seem to be the
tremendous groundswell of enthusiasm here for this functionality that I
had eagerly anticipated.  I'll make a pitch for it anyway at the end of
my talk at YAPC::Europe, and then I'll freeze the RFC.  

Thanks for the input,

Peter

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-13 Thread Robert Mathews


Hugo wrote:
 The difficulty with variable-length lookbehind (let's call it
 VLLB) is this: suppose that we want to match "abcdef...xyz" =~
 /(?=x+)y/. In theory, to check the possible /x+/ matches in
 the right order [0] we need to check whether there we can match
 0 characters at offset 0 (no), then check whether we can match
 1 character at offset 0 or 0 characters from offset 1 (no), then
 check whether we can match 2 characters from offset 0, or 1 from
 1, or 0 from 2 ... and so it goes on. So we'd be trying a complete
 subpattern, of arbitrary complexity, against O(n^2) strings.

You're assuming that the regex engine will always match things forward,
but the subject of this thread says it should be able to go backward. 
You're interpreting (?=ab*) to mean "try to match /ab*$/ against what
we've already scanned".  But what if it meant "try to match /^b*a/
against reverse(what we've already scanned)"?

Notice, the pattern is reversed too.  There are probably some gotchas
associated with reversing patterns.  Such as $ matching \n at
end-of-string, while ^ doesn't match \n at the beginning.  I'll be
optimistic and hope these are solvable.

The implementation could do the reverse match by making the regex engine
scan the string in reverse, as was suggested.  That seems like a huge,
and possibly quite inefficient, change to the code, as several people
have pointed out.  Alternatively, it might be possible to actually make
a reversed copy of the string (once, before starting the match at all)
and do the reverse match against that.  This should have negligible
impact on code that doesn't use variable length lookbehind.  It probably
wouldn't play too well with the "incremental pattern matching" proposals
that are being tossed around, though.  Ah, well.

Doing the matching in reverse would have some subtle effects.  For
example, in (?=([ab]+)([bc]+)), the second greedy match would gobble up
b's before the first one saw them.  I don't see that this is a problem,
as long as this behavior is documented.

What about nesting lookbehinds inside of lookbehinds?  Ouch...

-- 
Robert Mathews
Software Engineer
Excite@Home

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-12 Thread mike mulligan


From: Hugo [EMAIL PROTECTED]
Sent: Monday, September 11, 2000 11:59 PM


 mike mulligan replied to Peter Heslin:
 : ... it is greedy in the sense of the forward matching "*" or "+"
constructs.
 : [snip]

 This is nothing to do with greediness and everything to do with
 left-to-rightness. The regexp engine does not look for x* except
 in those positions where the lookbehind has already matched.

I was trying to understand at what point the lookbehind was attempted, and
confused myself and posted a bad example.  My apologies to everyone.  Let's
see if I can make sense of it on a second try.

My question is: if I have the regex  /(?=[aeiou]X[yz]+/  then does Perl: 1.
scan first for 'X', test the lookbehind, and then test the '[yz]',  or 2.
scan for 'X[yz]' and then test the lookbehind?

I am expecting these two alternatives to give the same result, but certain
test strings might run slower or faster depending on the approach.

Running perl -Dr shows that alternative 1 is used:

qq(aXuhXvoXyz) =~ /(?=[aeiou])X[yz]/

Guessing start of match, REx `(?=[aeiou])X[yz]' against `aXuhXvoXyz'...
Found anchored substr `X' at offset 1...
Guessed: match at offset 1
Matching REx `(?=[aeiou])X[yz]' against `XuhXvoXyz'
  Setting an EVAL scope, savestack=3
   1 a XuhXvoXyz  |  1:  IFMATCH[-1]
   0  aXuhXvoXyz  |  3:ANYOF[aeiou]
   1 a XuhXvoXyz  | 12:SUCCEED
  could match...
   1 a XuhXvoXyz  | 14:  EXACT X
   2 aX uhXvoXyz  | 16:  ANYOF[yz]
failed...
  Setting an EVAL scope, savestack=3
   4 aXuh XvoXyz  |  1:  IFMATCH[-1]
   3 aXu hXvoXyz  |  3:ANYOF[aeiou]
  failed...
failed...
  Setting an EVAL scope, savestack=3
   7 aXuhXvo Xyz  |  1:  IFMATCH[-1]
   6 aXuhXv oXyz  |  3:ANYOF[aeiou]
   7 aXuhXvo Xyz  | 12:SUCCEED
  could match...
   7 aXuhXvo Xyz  | 14:  EXACT X
   8 aXuhXvoX yz  | 16:  ANYOF[yz]
   9 aXuhXvoXy z  | 25:  END
Match successful!

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-12 Thread Hugo


In 085601c01cc8$2c94f390$[EMAIL PROTECTED], "mike mulligan" w
rites:
:From: Hugo [EMAIL PROTECTED]
:Sent: Monday, September 11, 2000 11:59 PM
:
:
: mike mulligan replied to Peter Heslin:
: : ... it is greedy in the sense of the forward matching "*" or "+"
:constructs.
: : [snip]
:
: This is nothing to do with greediness and everything to do with
: left-to-rightness. The regexp engine does not look for x* except
: in those positions where the lookbehind has already matched.
:
:I was trying to understand at what point the lookbehind was attempted, and
:confused myself and posted a bad example.  My apologies to everyone.  Let's
:see if I can make sense of it on a second try.
:
:My question is: if I have the regex  /(?=[aeiou]X[yz]+/  then does Perl: 1.
:scan first for 'X', test the lookbehind, and then test the '[yz]',  or 2.
:scan for 'X[yz]' and then test the lookbehind?

3. The regexp is matched left to right: first the lookbehind, then 'X',
then '[yz]'.

:I am expecting these two alternatives to give the same result, but certain
:test strings might run slower or faster depending on the approach.
:
:Running perl -Dr shows that alternative 1 is used:

Running perl -Dr shows that alternative 3 is used. However the -Dr data
is confused by the optimiser, which happens to have chosen the fixed
string 'X' as something worth searching for first. So the optimiser
permits the main matching engine to look only at those positions where
there is an 'X' immediately following.

I've annotated the -Dr output below to try and clarify. Note that if
you replace 'X' with '(x|X)', no optimisations take place (other than
a 'minimum length' check) and -Dr will give a much clearer picture of
the flow; again, if you replace 'X[yz]' with '(x|X)y' the optimiser
will now pick 'y' as the most significant thing worth searching for.

Hope this helps,

Hugo
---
:qq(aXuhXvoXyz) =~ /(?=[aeiou])X[yz]/
:
:Guessing start of match, REx `(?=[aeiou])X[yz]' against `aXuhXvoXyz'...

The optimiser is entered.

:Found anchored substr `X' at offset 1...

This is what the optimiser is looking for.

:Guessed: match at offset 1

This is what the optimiser found.

:Matching REx `(?=[aeiou])X[yz]' against `XuhXvoXyz'

The real matcher is entered.

:  Setting an EVAL scope, savestack=3
:   1 a XuhXvoXyz  |  1:  IFMATCH[-1]
:   0  aXuhXvoXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:   1 a XuhXvoXyz  | 12:SUCCEED

Ok.

:  could match...
:   1 a XuhXvoXyz  | 14:  EXACT X

Checking 'X' ...

:   2 aX uhXvoXyz  | 16:  ANYOF[yz]

Checking '[yz]' ...

:failed...

Failed: try the next position permitted by the optimiser.

:  Setting an EVAL scope, savestack=3
:   4 aXuh XvoXyz  |  1:  IFMATCH[-1]
:   3 aXu hXvoXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:  failed...

Failed.

:failed...

Try the next position permitted by the optimiser.

:  Setting an EVAL scope, savestack=3
:   7 aXuhXvo Xyz  |  1:  IFMATCH[-1]
:   6 aXuhXv oXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:   7 aXuhXvo Xyz  | 12:SUCCEED

Ok.

:  could match...
:   7 aXuhXvo Xyz  | 14:  EXACT X

Checking 'X' ...

:   8 aXuhXvoX yz  | 16:  ANYOF[yz]

Checking '[yz]' ...

:   9 aXuhXvoXy z  | 25:  END
:Match successful!

Match successful.

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

5 matches

Site Navigation

Mail list logo

Footer information