Hi everyone, I've noticed new behavior in `regexpr(..., perl = TRUE)` on Windows with R4.0 and above with Unicode characters. Here's a minimal example where I'd expect to see a start value of `5` (as R 3.6.2 and below gives), but R 4.0.0 (and R 4.0.1) now returns:
``` > regexpr("b", "foo\U0001F937bar", perl = TRUE) #> [1] 6 #> attr(,"match.length") #> [1] 1 ``` Perhaps this change in behavior could be explained by R4.0's migration to PCRE2? Here is some relevant output from my R4.0 session: ``` > pcre_config() #> UTF-8 Unicode properties JIT stack #> TRUE TRUE TRUE FALSE ``` ``` > extSoftVersion() #> zlib bzlib xz PCRE #> "1.2.11" "1.0.8, 13-Jul-2019" "5.2.4" "10.33 2019-04-16" #> ICU TRE iconv readline BLAS #> "58.2" "TRE 0.8.0 R_fixes (BSD)" "win_iconv" "" "" ``` Let me know if there's any more information I can provide to help replicate and isolate the issue. Also, if this happens to be the expected behavior, I'd be keen to learn about why that's the case. Thank you, -Carson -- Carson Sievert, PhD Software Engineer at RStudio Website <https://cpsievert.me> | Twitter <https://twitter.com/cpsievert> | GitHub <https://github.com/cpsievert> [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel