On 7/25/23 03:13, Brodie Gaslam via R-devel wrote:


On 7/24/23 4:10 AM, Duncan Murdoch wrote:
On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:


On 7/23/23 4:29 PM, Duncan Murdoch wrote:
The help page for `?gsub` says (in the context of performance
considerations):


"... just one UTF-8 string will force all the matching to be done in
Unicode"

It's been a little while since I looked at the code but IIRC this just
means that strings are converted to UTF-8 before matching. The problem
here seems to be more about the interpretation of the "\\w+" token by
PCRE.  I think this makes it a little clearer what's going on:

      gsub("\\w", "a", "Γ", perl=TRUE)
      [1] "Γ"

So no match.  The PCRE docs
https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
the old docs, but it works for our purposes here) mention we can turn on
unicode property matching with the "(*UCP)" token:

       gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
       [1] "a"

So there are two layers at play here.  The first one is whether R
converts strings to UTF-8, which I think is what the documentation is
about.  The other is whether the PCRE engine is configured to recognize
Unicode properties, which at least in both of our configurations for
this specific case it appears like it is not.

 From the surrounding context, I think the docs are talking about more than just conversion to UTF-8.  The full paragraph reads like this:

"If you are working in a single-byte locale (though not common since R 4.2) and have marked UTF-8 strings that are representable in that locale, convert them first as just one UTF-8 string will force all the matching to be done in Unicode, which attracts a penalty of around
3× for the default POSIX 1003.2 mode."

i.e. it says the presence of UTF-8 strings slows things down by a factor of 3, so it's faster to convert everything to the local encoding.  If it was just conversion, I don't think that would be true.

But maybe "for the default POSIX 1003.2 mode" applies to the whole paragraph, not just to the penalty, so this is intentional.

Agreed, I don't think this whole issue is just about the conversion. What I'm trying to highlight is the distinction between what R does (converts input to Unicode - UTF-8 for PCRE[1], wchar_t for POSIX/TRE[2]), and what the regular expression engines then do (match that Unicode per their own semantics).  This for the case of any UTF-8 in the input.

PCRE is behaving as documented[3]:

> By default, characters whose code points are greater than 127 never match \d, \s, or \w, and always match \D, \S, and \W, although this may be different for characters in the range 128-255 when locale-specific matching is happening. These escape sequences retain their original meanings from before Unicode support was available, mainly for efficiency reasons. If the PCRE2_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows...

So this doesn't seem like a bug to me.

Does that mean that the following is incorrect?

> one UTF-8 string will force all the matching to be done in Unicode

It depends on how you want to interpret "done in".  Less ambiguous could be:

> one UTF-8 string will force all strings to be converted to Unicode prior to matching.

I've added a note to ?regexp about enabling Unicode properties in patterns using (*UCP). I understand that it may be surprising to users these are not fully enabled by default (PCRE2_UCP not set), but then it is the default behavior of PCRE2 and most likely chosen for performance reasons (see [3]), and ?regexp refers to PCRE documentation.

Re ?gsub, I think it is ok, the matching is in Unicode/UTF-8. Whether the Unicode property support is available or how to fully enable it is another matter, not discussed in this part of the documentation.

Best
Tomas


Best,

B

[1]: https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1385 [2]: https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1378
[3]: https://pcre.org/current/doc/html/pcre2pattern.html


Duncan Murdoch

Best,

B.




However, this thread on SO: https://stackoverflow.com/q/76749529 gives
some indication that this is not true for `perl = TRUE`. Specifically:

  > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
"Jean-François Dupuis")
  > Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
  > regex <- "\\B\\w+| +"
  > gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

  > gsub(regex, "", strings, perl = TRUE)
[1] "85"                  "JS" "ΓιάννηςΠαπαδόπουλος"
"J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answer
when the regex option /u ("match with full Unicode) is specified, but
the second answer when it is not.

Now I'm not at all sure that that website is authoritative, but this
looks like a flag may have been missed in the `perl = TRUE` case.

Duncan Murdoch

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to