Re: [MarkLogic Dev General] Regular expression bug

Chris Maloney Fri, 28 May 2010 07:42:54 -0700

On Thu, May 27, 2010 at 10:23 PM, Danny Sokolsky <
[email protected]> wrote:


>   Sorry this one slipped through the cracks for a while...better late than

>   never.

NP!  Thanks for looking at it.

>   We think this is not actually a bug, even though it appears so at first
>   glance.

Hmm...

>   The reason is that the specification is vague and leaves certain
>   details to the implementation. It says that ungreedy/reluctant
quantifiers
>   are required to match the shortest possible substring, but it does not
give
>   rules for the priority of sub-expressions/capturing groups.

This doesn't quite make sense.  I read the specification, and looked at my
example again.  It has nothing to do with "priority of
sub-expressions/capturing
groups".  It has only to do with the scope of the '?'.  The
spec says "Reluctant quantifiers are supported. They are indicated by a '?'
following a quantifier. "  I read this to indicate that the '?' should apply
to
the quantifier that precedes it.  It seems very clear.  If a quantifier
isn't followed
by the '?', then it should not be "reluctant".

>   So, it's up to
>   the implementation. If you want to read some gory details, check out
>   http://www.w3.org/TR/xpath-functions/#string.match.

>   POSIX does define such rules, but it doesn't have the notion of ungreedy

>   quantifiers. Perl doesn't try to define the rules; there is no such
thing
>   as a Perl specification.

The Perl implementation is the Perl specification.

>   The closest thing is a description of the
>   implementation, which is an inherently low-performance approach
involving
>   trying one match at a time (i.e. backtracking). This is not a great
>   approach.

Perl is open source, and also very well-performing.  You could open it up
and
take a look (but granted, I haven't done so -- I don't know if it would be a
can
of worms or not.)

>   So the MarkLogic implementation chose the more performant approach.

In my tests so far, MarkLogic is about two or three times slower than Perl
on
regular expressions.  But I haven't polished the tests, and I might not be
comparing "apples to apples" yet.  I'll let you know.

>   In the 1.0-ml dialect, however, there is an undocumented “p” flag to the

>   functions that take a regex that does the perl-like matching (it is an
>   extension to the spec, so it is not available in the 1.0 dialect).

Why isn't it documented?  It seems like something that should be.

>   I think
>   your workaround is a better approach, however.

Cheers!

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Regular expression bug

Reply via email to