The online demo you linked to, Sam, does exactly what I need! I've found a few other doublets in the file that clearly should not be there. Thank you all for your thoughts and inputs.
On Monday, April 25, 2022 at 8:38:09 PM UTC+2 Sam Hathaway wrote: > This sounds like a case of the Longest repeated substring problem > <https://en.wikipedia.org/wiki/Longest_repeated_substring_problem>. > Regular expressions are not the right tool for the job, unfortunately. > > There’s an online demo > <https://daniel-hug.github.io/longest-repeated-substring/> that might do > what you need. > > If you want to find all long repeated substrings, you can take an > iterative approach: find the longest, remove the duplicates from the source > text, and again find the longest. > > Hope this helps, > -sam > > On 25 Apr 2022, at 11:42, samar wrote: > > Hi all > > While copyediting a text for a scholarly book (500+ pages when printed), I > noticed that the author wrote exactly the same long sentence (= an > identical string of 337 characters) once on page 23 and once on page 326. > No doubt this happened because the author copied and pasted some text from > his notes, unaware that he had already copied and pasted the same text > earlier. I thought it would be a good idea to find out whether this has > happened to the author more than one time in his 1,000,000-character book, > so that I can alert him (to give him a chance to omit the repetition). > > And so I turned to BBEdit. The text of the whole book is now in a txt > file. When I search for the sentence that in the Word document is on page > 23, I can find it in BBEdit both in paragraph 117 and in paragraph 7831. > What regular expression can I use to find other such repetitions? > > I tried using the following string: > > (?s)(.{200}).*?\1 > > This is what I understand it to mean (roughly): > > (?s): search across paragraphs > (.{200}).*?: search for, and capture, a string of 200 characters, > optionally followed by any characters > \1: stop the search as soon as you reach a second instance of the captured > string > > The string does what I need if I replace 200 with a shorter number, such > as 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of > course). Given that the sentence I have in mind is more than 300 characters > long I should even have been able to use 300 instead of just 200. > > Unfortunately, however, something seems to be amiss: BBEdit kept on > searching and searching, without finding anything, and my notebook started > fanning, and after about 20 minutes it became clear that nothing would > happen, and that I cannot do anything else but to Force Quit BBEdit. > > So my question is, what's wrong with the above string? How else can I find > a repeated 200-character sentence in a large text file? > > Thanks > Sam > > -- > This is the BBEdit Talk public discussion group. If you have a feature > request or need technical support, please email "[email protected]" > rather than posting here. Follow @bbedit on Twitter: < > https://twitter.com/bbedit> > --- > You received this message because you are subscribed to the Google Groups > "BBEdit Talk" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com > > <https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > -- This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "[email protected]" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit> --- You received this message because you are subscribed to the Google Groups "BBEdit Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/2862cecc-4ce5-4375-bcab-c488abb8e026n%40googlegroups.com.
