BTW this is documented here http://pcre.org/current/doc/html/pcre2api.html#infoaboutpattern with a helpful example, copied below.
As a simple example of the name/number table, consider the following pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED is set, so white space - including newlines - is ignored): (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) ) There are four named capture groups, so the table has four entries, and each entry in the table is eight bytes long. The table is as follows, with non-printing bytes shows in hexadecimal, and undefined bytes shown as ??: 00 01 d a t e 00 ?? 00 05 d a y 00 ?? ?? 00 04 m o n t h 00 00 02 y e a r 00 ?? On Mon, Sep 4, 2023 at 3:02 AM Duncan Murdoch <murdoch.dun...@gmail.com> wrote: > > This Stackoverflow question https://stackoverflow.com/q/77036362 turned > up a bug in the R PCRE interface. > > The example (currently in an edit to the original question) tried to use > named capture with more than 127 named groups. Here's the code: > > append_unique_id <- function(x) { > for (i in seq_along(x)) { > x[i] <- paste0("<", paste(sample(letters, 10), collapse = ""), ">", > x[i]) > } > x > } > > list_regexes <- sample(letters, 128, TRUE) # <<<<<<<<<<< change this to > # 127 and it works > regex2 <- append_unique_id(list_regexes) > regex2 <- paste0("(?", regex2, ")") > regex2 <- paste(regex2, collapse = "|") > > out <- gregexpr(regex2, "Cyprus", perl = TRUE, ignore.case = TRUE) > #> Error in gregexpr(regex2, "Cyprus", perl = TRUE, ignore.case = TRUE): > attempt to set index -129/128 in SET_STRING_ELT > > I think the bug is in R, here: > https://github.com/wch/r-source/blob/57d15d68235dd9bcfaa51fce83aaa71163a020e1/src/main/grep.c#L3079 > > This is the line > > int capture_num = (entry[0]<<8) + entry[1] - 1; > > where entry is declared as a pointer to a char. What this is doing is > extracting a 16 bit number from the first two bytes of a character > string holding the name of the capture group. Since char is a signed > type, the conversion of bytes to integer gets messed up and the value > comes out wrong. > > Duncan Murdoch > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel