Thanks, Omar, for the hint to the 'j' flag in Saxon. Sounds enticing; I think we can include it in BaseX as well.
Omar Siam <omar.s...@oeaw.ac.at> schrieb am Mi., 8. Aug. 2018, 12:58: > Hi > > I think the problem is: There are numerous implemetations of regular > expressions which have a common subset but are different in the more > advanced features. > > Using the java regular expression implementation you can use greedy and > some other things. The XSL and XQuery implementation according to the > standards does not allow this and so misinterpretes the regular > expression. See here: > https://www.w3.org/TR/xpath-functions-31/#regex-syntax > > You can tell Saxon to use a different regexp engine such as the standard > Java one: > https://www.saxonica.com/html/documentation/functions/fn/matches.html > > Best regards > > Omar > > > Am 07.08.2018 um 21:38 schrieb Andreas Mixich: > > Hi > > > > [rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a nice > > regular expression, which groups any URI, including URN, by URI > component. > > > > Interesting about this regex is the use of the '?' quantifier which > > makes every preceding group/component optional, thus matching either an > > URI or any other(!) string, since anything, that does not match one of > > the special groups, goes into a catch-all group (no.5), which keeps > > either the path or the full, arbitrary string. This is neglectable, > > since the input to this regex is guaranteed to be of the right type > > (a/@href/string()). > > > > Here is the relevant part from the RFC. > > > > Appendix B > > > > ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? > > 12 3 4 5 6 7 8 9 > > > > The numbers in the second line above are only to assist > > readability; they indicate the reference points for each > > subexpression (i.e., each paired parenthesis). We refer to the > > value matched for subexpression <n> as $<n>. For example, matching > > the above expression to > > > > http://www.ics.uci.edu/pub/ietf/uri/#Related > > > > results in the following subexpression matches: > > > > $1 = http: > > $2 = http > > $3 = //www.ics.uci.edu > > $4 = www.ics.uci.edu > > $5 = /pub/ietf/uri/ > > $6 = <undefined> > > $7 = <undefined> > > $8 = #Related > > $9 = Related > > > > where <undefined> indicates that the component is not present, > > as is the case for the query component in the above example. > > Therefore, we can determine the value of the five components as > > > > scheme = $2 > > authority = $4 > > path = $5 > > query = $7 > > fragment = $9 > > > > Going in the opposite direction, we can recreate a URI reference > > from its components by using the algorithm of Section 5.3. > > > > > > I tested this regex with Saxon, eXist and BaseX. eXist successfully > > parsed all the test-cases, I threw at it, into the right groups, Saxon > > and BaseX did not. The failure is: > > > > [FORX0003] Pattern matches empty string.. > > > > And that got me baffled, since all three processors use Java underneath > > and since the definition of the '?' quantifier, when used like this, > > seems to be: > > > > Makes the preceding item optional. Greedy, so the optional item > > is included in the match if possible. > > > > Which means, that *if* any of the group's contents match, they should be > > included, rather than producing an empty string. > > > > Why is it like that? And what can I do about it? I found no other URI > > parsing regex, that componentizes this way and would be compatible with > > XQuery. > > > > See, attached, a test-case. > > > >