Samar, first off this is a really cool challenge. Personally I'd use JS as the tool of choice if only to have a lot of control over length of string, reporting, real-feedback and error-checking, fine tuning, handling multiple docs, and the ability to stop easily when the code errs. (To be fair JS always my preferred go-to, so there's that. People who can make this work with regex and regex alone always impress). If you go JS lmk
Thanks for sharing. Following. > On Apr 25, 2022, at 08:42, samar <[email protected]> wrote: > Hi all > > While copyediting a text for a scholarly book (500+ pages when printed), I > noticed that the author wrote exactly the same long sentence (= an identical > string of 337 characters) once on page 23 and once on page 326. No doubt this > happened because the author copied and pasted some text from his notes, > unaware that he had already copied and pasted the same text earlier. I > thought it would be a good idea to find out whether this has happened to the > author more than one time in his 1,000,000-character book, so that I can > alert him (to give him a chance to omit the repetition). > > And so I turned to BBEdit. The text of the whole book is now in a txt file. > When I search for the sentence that in the Word document is on page 23, I can > find it in BBEdit both in paragraph 117 and in paragraph 7831. What regular > expression can I use to find other such repetitions? > > I tried using the following string: > > (?s)(.{200}).*?\1 > > This is what I understand it to mean (roughly): > > (?s): search across paragraphs > (.{200}).*?: search for, and capture, a string of 200 characters, optionally > followed by any characters > \1: stop the search as soon as you reach a second instance of the captured > string > > The string does what I need if I replace 200 with a shorter number, such as > 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of > course). Given that the sentence I have in mind is more than 300 characters > long I should even have been able to use 300 instead of just 200. > > Unfortunately, however, something seems to be amiss: BBEdit kept on searching > and searching, without finding anything, and my notebook started fanning, and > after about 20 minutes it became clear that nothing would happen, and that I > cannot do anything else but to Force Quit BBEdit. > > So my question is, what's wrong with the above string? How else can I find a > repeated 200-character sentence in a large text file? > > Thanks > Sam > -- > This is the BBEdit Talk public discussion group. If you have a feature > request or need technical support, please email "[email protected]" > rather than posting here. Follow @bbedit on Twitter: > <https://twitter.com/bbedit> > --- > You received this message because you are subscribed to the Google Groups > "BBEdit Talk" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com. -- This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "[email protected]" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit> --- You received this message because you are subscribed to the Google Groups "BBEdit Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/DEB0C960-89E1-43C1-81D3-ED12E8845E73%40gmail.com.
