Thanks, Omar, for the hint to the 'j' flag in Saxon. Sounds enticing; I
think we can include it in BaseX as well.


Omar Siam <omar.s...@oeaw.ac.at> schrieb am Mi., 8. Aug. 2018, 12:58:

> Hi
>
> I think the problem is: There are numerous implemetations of regular
> expressions which have a common subset but are different in the more
> advanced features.
>
> Using the java regular expression implementation you can use greedy and
> some other things. The XSL and XQuery implementation according to the
> standards does not allow this and so misinterpretes the regular
> expression. See here:
> https://www.w3.org/TR/xpath-functions-31/#regex-syntax
>
> You can tell Saxon to use a different regexp engine such as the standard
> Java one:
> https://www.saxonica.com/html/documentation/functions/fn/matches.html
>
> Best regards
>
> Omar
>
>
> Am 07.08.2018 um 21:38 schrieb Andreas Mixich:
> > Hi
> >
> > [rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a nice
> > regular expression, which groups any URI, including URN, by URI
> component.
> >
> > Interesting about this regex is the use of the '?' quantifier which
> > makes every preceding group/component optional, thus matching either an
> > URI or any other(!) string, since anything, that does not match one of
> > the special groups, goes into a catch-all group (no.5), which keeps
> > either the path or the full, arbitrary string. This is neglectable,
> > since the input to this regex is guaranteed to be of the right type
> > (a/@href/string()).
> >
> > Here is the relevant part from the RFC.
> >
> >    Appendix B
> >
> >    ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
> >           12            3  4          5       6  7        8 9
> >
> >       The numbers in the second line above are only to assist
> >       readability; they indicate the reference points for each
> >       subexpression (i.e., each paired parenthesis).  We refer to the
> >       value matched for subexpression <n> as $<n>.  For example, matching
> >       the above expression to
> >
> >          http://www.ics.uci.edu/pub/ietf/uri/#Related
> >
> >       results in the following subexpression matches:
> >
> >          $1 = http:
> >          $2 = http
> >          $3 = //www.ics.uci.edu
> >          $4 = www.ics.uci.edu
> >          $5 = /pub/ietf/uri/
> >          $6 = <undefined>
> >          $7 = <undefined>
> >          $8 = #Related
> >          $9 = Related
> >
> >       where <undefined> indicates that the component is not present,
> >       as is the case for the query component in the above example.
> >       Therefore, we can determine the value of the five components as
> >
> >          scheme    = $2
> >          authority = $4
> >          path      = $5
> >          query     = $7
> >          fragment  = $9
> >
> >       Going in the opposite direction, we can recreate a URI reference
> >       from its components by using the algorithm of Section 5.3.
> >
> >
> > I tested this regex with Saxon, eXist and BaseX. eXist successfully
> > parsed all the test-cases, I threw at it, into the right groups, Saxon
> > and BaseX did not. The failure is:
> >
> >      [FORX0003] Pattern matches empty string..
> >
> > And that got me baffled, since all three processors use Java underneath
> > and since the definition of the '?' quantifier, when used like this,
> > seems to be:
> >
> >      Makes the preceding item optional. Greedy, so the optional item
> >      is included in the match if possible.
> >
> > Which means, that *if* any of the group's contents match, they should be
> > included, rather than producing an empty string.
> >
> > Why is it like that? And what can I do about it? I found no other URI
> > parsing regex, that componentizes this way and would be compatible with
> > XQuery.
> >
> > See, attached, a test-case.
> >
>
>

Reply via email to