It seems like a bug to me. Using perl = TRUE, I see the desired result: ``` x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"
pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n" cat(regmatches(x, regexpr(pattern2, x, perl = TRUE))) ``` If you change it to something like: ``` x <- c( "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n", "\n```html\nblah blah \n```\n" ) pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n" print(regmatches(x, regexpr(pattern2, x)), width = 10) ``` you can see that it does find the match, so the combination of *? and \\1 must be messing up regexpr(). They seem to work perfectly fine on their own. On Wed, Jan 25, 2023 at 7:57 PM Duncan Murdoch <murdoch.dun...@gmail.com> wrote: > > Thanks for pointing out my mistake. I oversimplified the real problem. > > I'll try to post a version of it that comes closer: Suppose I have a > string like this: > > x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n" > > If I cat() it, I see that it is really markdown source: > > ```html > blah blah > ``` > > ```r > blah blah > ``` > > I want to find the part that includes the html block, but not the r > block. So I want to match "```html", followed by a minimal number of > characters, then "```". Then this pattern works: > > pattern <- "\n```html\n.*?\n```\n" > > and we get the right answer: > > cat(regmatches(x, regexpr(pattern, x))) > > ```html > blah blah > ``` > > Okay, but this flavour of markdown says there can be more backticks, not > just 3. So the block might look like > > ````html > blah blah > ```` > > I need to have the same number of backticks in the opening and closing > marker. So I make the pattern more complicated, and it doesn't work: > > pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n" > > This matches all of x: > > > pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n" > > cat(regmatches(x, regexpr(pattern2, x))) > > ```html > blah blah > ``` > > ```r > blah blah > ``` > > > Is that a bug, or am I making a silly mistake again? > > Duncan Murdoch > > > > On 25/01/2023 7:34 p.m., Andrew Simmons wrote: > > grep(value = TRUE) just returns the strings which match the pattern. You > > have to use regexpr() or gregexpr() if you want to know where the > > matches are: > > > > ``` > > x <- "abaca" > > > > # extract only the first match with regexpr() > > m <- regexpr("a.*?a", x) > > regmatches(x, m) > > > > # or > > > > # extract every match with gregexpr() > > m <- gregexpr("a.*?a", x) > > regmatches(x, m) > > ``` > > > > You could also use sub() to remove the rest of the string: > > `sub("^.*(a.*?a).*$", "\\1", x)` > > keeping only the match within the parenthesis. > > > > > > On Wed, Jan 25, 2023, 19:19 Duncan Murdoch <murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>> wrote: > > > > The docs for ?regexp say this: "By default repetition is greedy, so > > the > > maximal possible number of repeats is used. This can be changed to > > ‘minimal’ by appending ? to the quantifier. (There are further > > quantifiers that allow approximate matching: see the TRE > > documentation.)" > > > > I want the minimal match, but I don't seem to be getting it. For > > example, > > > > x <- "abaca" > > grep("a.*?a", x, value = TRUE) > > #> [1] "abaca" > > > > Shouldn't I have gotten "aba", which is the first match to "a.*a"? If > > not, what would be the regexp that would give me the first match to > > "a.*a", without greedy expansion of the .*? > > > > Duncan Murdoch > > > > ______________________________________________ > > R-help@r-project.org <mailto:R-help@r-project.org> mailing list -- > > To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > <https://stat.ethz.ch/mailman/listinfo/r-help> > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > <http://www.R-project.org/posting-guide.html> > > and provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.