Yes, that's the way the original '\b(\w+)+\1\b’ works in practice. The first zero or more word characters aren't captured but are included in the match so they get deleted in the replacement of using just the \1 capture group. Since \w* and (?:\w+)* are match equivalent in practice, perhaps the expression '\b(?:\w+)*(\w+)\1\b’ will better explain how it is just the last iteration match of (\w+) of the (\w+)+\1 expression that is captured and how any and all of the preceding groupings of \w+ matches, if any, are discarded as captures and aren't included in the one, final capture group.
Take for example the word facilisis. The regular expression engine ends up finding a leading match on - facil - a group 1 capturing match on - is - and a non-capturing match to capture group 1 - is . The whole matched word string then gets replaced by just the capture group 1 string of 'is' (without the quotes). Your guess is as good as mine as to how much string slicing and dicing; capturing and capture discarding the engine is performing before arriving at that match and capture group solution. That said, I flubbed the copy and paste in the last of that comment discussing backtracking. I intended to use '\b(\w+)+\1\b’ for the backtracking comment part but instead copied and pasted '\b\w*(\w+)\1\b’. As it turns out both have a whole lot of backtracking but '\b\w*(\w+)\1\b’ has slightly less backtracking than '\b(\w+)+\1\b’ on the example search text I was using. On Sunday, December 15, 2024 at 4:14:09 PM UTC-8 Bruce Van Allen wrote: > Thanks for digging into the regex meaning of that second ‘+’ in > '\b(\w+)+\b’. > > As it turned out, the OP needed to find repeated words, not characters, so > inserting a spacebar space for the second plus sign totally works for them. > > Also, I’m not sure you’re suggesting this but at the end of your comment > you’re talking about the pattern '\b\w*(\w+)\1\b’. That first zero or more > word characters - \w* - won’t be captured and so won’t be in the > replacement pattern. Is that what you meant? > > Best, > > — Bruce > > _bruce__van_allen__santa_cruz_ca_ > > > > On Dec 15, 2024, at 3:50 PM, GP <[email protected]> wrote: > > > > First with BBEdit 15.1.3 (15B62, Apple Silicon) I didn't get any error > with ce gm's grep find and replace. > > > > That said, however, I found the second + is doing something in the find > and replace operation. > > > > Using Howard's posted sample records test from the "Sorting multiple > records in a text file" for testing text. Using the Pattern Playground with > the find: '\b(\w+)+\1\b’ (without the quotes) and replace: \1 pattern, 7 > matches were found: > > 0 -> facilisis > > 1 -> is > > replacement -> is > > > > 0 -> Underhill > > 1 -> l > > replacement -> l > > > > 0 -> 11 > > 1 -> 1 > > replacement -> 1 > > > > 0 -> Afterall > > 1 -> l > > replacement -> l > > > > 0 -> 11 > > 1 -> 1 > > replacement -> 1 > > > > 0 -> 22 > > 1 -> 2 > > replacement -> 2 > > > > 0 -> Afterall > > 1 -> l > > replacement -> l > > > > whereas, with the find: '\b(\w+)\1\b’ (without the second + and without > the quotes) and same replace pattern, only 3 matches were found: > > 0 -> 11 > > 1 -> 1 > > replacement -> 1 > > > > 0 -> 11 > > 1 -> 1 > > replacement -> 1 > > > > 0 -> 22 > > 1 -> 2 > > replacement -> 2 > > > > According to https://regex101.com's explanation, the difference is due > to the capturing group workings of the (\w+)+ part of the regular > expression: "A repeated capturing group will only capture the last > iteration." So, if I'm not mistaken, the workings of (\w+)+ is equivalent > to \w*(\w+) and the equivalent find grep is \b\w*(\w+)\1\b . That would > match any word string containing zero or more word characters followed by a > capturing group of one or more word characters followed by a single repeat > of the captured group of characters. According to regex101.com's Regex > Debugger there's a whole lot of backtracking going on to find all the > matches with the \b\w*(\w+)\1\b grep. > > On Saturday, December 14, 2024 at 3:07:35 PM UTC-8 Bruce Van Allen wrote: > > Hi, > > > > An example of the text and a description of what you’re trying to > accomplish would help. > > > > From your find pattern, I’m guessing you’re trying to find cases where a > string is followed by the same string, to be replaced by just one instance > of the string. > > > > '\b(\w+)+\1\b’ (your original - without the quotes) > > > > Your find pattern’s second plus sign ‘+’ isn’t doing anything, because > the first one, which quantifies the ‘\w’, is grabbing every consecutive > word/alphanumeric character including any repetitions. > > > > Removing that second ‘+', the find pattern '\b(\w+)\1\b’ (without the > quotes) will find a string of word characters followed immediately by the > same string, as in ‘My sentence is abcabc for defdef.’ Using your > replacement pattern of ‘\1’, this will become ‘My sentence is abc for def.’ > > > > Guessing that you’re are actually looking for duplicated WORDS, if the > find pattern has a spacebar space ‘ ‘ then it will find any word followed > by a space and then the same exact word, and the replacement will eliminate > the duplication. > > > > With find pattern '\b(\w+) \1\b’, your replacement pattern makes 'My > sentence is abc abc for def def.’ into 'My sentence is abc for def.’ > > > > If you want to find a string of word characters that matches an earlier > instance of the same string but separated by more than just a space, your > pattern may be more complicated. > > > > HTH and please clarify if my guesses are wrong. > > > > — Bruce > > > > _bruce__van_allen__santa_cruz_ca_ > > > > > > > On Dec 14, 2024, at 1:43 PM, ce gm <[email protected]> wrote: > > > > > > Hello there, > > > > > > I am doing a GREP search on a .txt file in Bbedit on my Mac. Here are > the find/replace terms: > > > Find: \b(\w+)+\1\b > > > Replace: \1 > > > > > > When I input the Find term, it correctly identifies the targets in the > preview (highlights them in yellow). Then, when I push Replace All, I get a > pop up with Application Error Code: 12247 and nothing else. > > > > > > Anyone know what this means? A cursory Google search was not helpful. > > > > > > Thanks! > > > > > > -- > > > This is the BBEdit Talk public discussion group. If you have a feature > request or believe that the application isn't working correctly, please > email "[email protected]" rather than posting here. Follow @bbedit on > Mastodon: <https://mastodon.social/@bbedit> > > > --- > > > You received this message because you are subscribed to the Google > Groups "BBEdit Talk" group. > > > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected]. > > > To view this discussion visit > https://groups.google.com/d/msgid/bbedit/c9e18d6f-f5c4-467e-9c01-fa4ffbaa5485n%40googlegroups.com. > > > > > > > > -- > > This is the BBEdit Talk public discussion group. If you have a feature > request or believe that the application isn't working correctly, please > email "[email protected]" rather than posting here. Follow @bbedit on > Mastodon: <https://mastodon.social/@bbedit> > > --- > > You received this message because you are subscribed to the Google > Groups "BBEdit Talk" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected]. > > To view this discussion visit > https://groups.google.com/d/msgid/bbedit/72b08e6c-5ac8-478c-8f54-9baddaeb18een%40googlegroups.com > . > > -- This is the BBEdit Talk public discussion group. If you have a feature request or believe that the application isn't working correctly, please email "[email protected]" rather than posting here. Follow @bbedit on Mastodon: <https://mastodon.social/@bbedit> --- You received this message because you are subscribed to the Google Groups "BBEdit Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/bbedit/1d01ca19-09c4-4eb3-9574-94a7ddd28b18n%40googlegroups.com.
