Re: [Rd] Bug in perl=TRUE regexp matching?

Tomas Kalibera Mon, 31 Jul 2023 04:02:25 -0700


On 7/25/23 03:13, Brodie Gaslam via R-devel wrote:

On 7/24/23 4:10 AM, Duncan Murdoch wrote:
On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:
On 7/23/23 4:29 PM, Duncan Murdoch wrote:
The help page for `?gsub` says (in the context of performance
considerations):


"... just one UTF-8 string will force all the matching to be done in
Unicode"
It's been a little while since I looked at the code but IIRC this just
means that strings are converted to UTF-8 before matching. The problem
here seems to be more about the interpretation of the "\\w+" token by
PCRE.  I think this makes it a little clearer what's going on:

      gsub("\\w", "a", "Γ", perl=TRUE)
      [1] "Γ"

So no match.  The PCRE docs
https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
the old docs, but it works for our purposes here) mention we canturn on
unicode property matching with the "(*UCP)" token:

       gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
       [1] "a"

So there are two layers at play here.  The first one is whether R
converts strings to UTF-8, which I think is what the documentation is
about.  The other is whether the PCRE engine is configured to recognize
Unicode properties, which at least in both of our configurations for
this specific case it appears like it is not.
From the surrounding context, I think the docs are talking aboutmore than just conversion to UTF-8. The full paragraph reads like this:
"If you are working in a single-byte locale (though not common sinceR 4.2) and have marked UTF-8 strings that are representable in thatlocale, convert them first as just one UTF-8 string will force allthe matching to be done in Unicode, which attracts a penalty of around
3× for the default POSIX 1003.2 mode."
i.e. it says the presence of UTF-8 strings slows things down by afactor of 3, so it's faster to convert everything to the localencoding. If it was just conversion, I don't think that would be true.
But maybe "for the default POSIX 1003.2 mode" applies to the wholeparagraph, not just to the penalty, so this is intentional.
Agreed, I don't think this whole issue is just about the conversion.What I'm trying to highlight is the distinction between what R does(converts input to Unicode - UTF-8 for PCRE[1], wchar_t forPOSIX/TRE[2]), and what the regular expression engines then do (matchthat Unicode per their own semantics). This for the case of any UTF-8in the input.
PCRE is behaving as documented[3]:
> By default, characters whose code points are greater than 127 nevermatch \d, \s, or \w, and always match \D, \S, and \W, although thismay be different for characters in the range 128-255 whenlocale-specific matching is happening. These escape sequences retaintheir original meanings from before Unicode support was available,mainly for efficiency reasons. If the PCRE2_UCP option is set, thebehaviour is changed so that Unicode properties are used to determinecharacter types, as follows...
So this doesn't seem like a bug to me.

Does that mean that the following is incorrect?

> one UTF-8 string will force all the matching to be done in Unicode
It depends on how you want to interpret "done in". Less ambiguouscould be:
> one UTF-8 string will force all strings to be converted to Unicodeprior to matching.

I've added a note to ?regexp about enabling Unicode properties inpatterns using (*UCP). I understand that it may be surprising to usersthese are not fully enabled by default (PCRE2_UCP not set), but then itis the default behavior of PCRE2 and most likely chosen for performancereasons (see [3]), and ?regexp refers to PCRE documentation.

Re ?gsub, I think it is ok, the matching is in Unicode/UTF-8. Whetherthe Unicode property support is available or how to fully enable it isanother matter, not discussed in this part of the documentation.


Best
Tomas


Best,

B

[1]:https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1385[2]:https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1378

[3]: https://pcre.org/current/doc/html/pcre2pattern.html


Duncan Murdoch


Best,

B.



However, this thread on SO: https://stackoverflow.com/q/76749529 gives
some indication that this is not true for `perl = TRUE`. Specifically:

  > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
"Jean-François Dupuis")
  > Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
  > regex <- "\\B\\w+| +"
  > gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

  > gsub(regex, "", strings, perl = TRUE)
[1] "85"                  "JS" "ΓιάννηςΠαπαδόπουλος"
"J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answer
when the regex option /u ("match with full Unicode) is specified, but
the second answer when it is not.

Now I'm not at all sure that that website is authoritative, but this
looks like a flag may have been missed in the `perl = TRUE` case.

Duncan Murdoch

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bug in perl=TRUE regexp matching?

Reply via email to