The online demo you linked to, Sam, does exactly what I need! I've found a 
few other doublets in the file that clearly should not be there. Thank you 
all for your thoughts and inputs.

On Monday, April 25, 2022 at 8:38:09 PM UTC+2 Sam Hathaway wrote:

> This sounds like a case of the Longest repeated substring problem 
> <https://en.wikipedia.org/wiki/Longest_repeated_substring_problem>. 
> Regular expressions are not the right tool for the job, unfortunately.
>
> There’s an online demo 
> <https://daniel-hug.github.io/longest-repeated-substring/> that might do 
> what you need.
>
> If you want to find all long repeated substrings, you can take an 
> iterative approach: find the longest, remove the duplicates from the source 
> text, and again find the longest.
>
> Hope this helps,
> -sam
>
> On 25 Apr 2022, at 11:42, samar wrote:
>
> Hi all
>
> While copyediting a text for a scholarly book (500+ pages when printed), I 
> noticed that the author wrote exactly the same long sentence (= an 
> identical string of 337 characters) once on page 23 and once on page 326. 
> No doubt this happened because the author copied and pasted some text from 
> his notes, unaware that he had already copied and pasted the same text 
> earlier. I thought it would be a good idea to find out whether this has 
> happened to the author more than one time in his 1,000,000-character book, 
> so that I can alert him (to give him a chance to omit the repetition).
>
> And so I turned to BBEdit. The text of the whole book is now in a txt 
> file. When I search for the sentence that in the Word document is on page 
> 23, I can find it in BBEdit both in paragraph 117 and in paragraph 7831. 
> What regular expression can I use to find other such repetitions?
>
> I tried using the following string:
>
> (?s)(.{200}).*?\1
>
> This is what I understand it to mean (roughly):
>
> (?s): search across paragraphs
> (.{200}).*?: search for, and capture, a string of 200 characters, 
> optionally followed by any characters
> \1: stop the search as soon as you reach a second instance of the captured 
> string
>
> The string does what I need if I replace 200 with a shorter number, such 
> as 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of 
> course). Given that the sentence I have in mind is more than 300 characters 
> long I should even have been able to use 300 instead of just 200.
>
> Unfortunately, however, something seems to be amiss: BBEdit kept on 
> searching and searching, without finding anything, and my notebook started 
> fanning, and after about 20 minutes it became clear that nothing would 
> happen, and that I cannot do anything else but to Force Quit BBEdit.
>
> So my question is, what's wrong with the above string? How else can I find 
> a repeated 200-character sentence in a large text file?
>
> Thanks
> Sam
>
> --
> This is the BBEdit Talk public discussion group. If you have a feature 
> request or need technical support, please email "[email protected]" 
> rather than posting here. Follow @bbedit on Twitter: <
> https://twitter.com/bbedit>
> ---
> You received this message because you are subscribed to the Google Groups 
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "[email protected]" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/2862cecc-4ce5-4375-bcab-c488abb8e026n%40googlegroups.com.

Reply via email to