This sounds like a case of the [Longest repeated substring
problem](https://en.wikipedia.org/wiki/Longest_repeated_substring_problem).
Regular expressions are not the right tool for the job, unfortunately.
There’s an [online
demo](https://daniel-hug.github.io/longest-repeated-substring/) that
might do what you need.
If you want to find all long repeated substrings, you can take an
iterative approach: find the longest, remove the duplicates from the
source text, and again find the longest.
Hope this helps,
-sam
On 25 Apr 2022, at 11:42, samar wrote:
Hi all
While copyediting a text for a scholarly book (500+ pages when
printed), I
noticed that the author wrote exactly the same long sentence (= an
identical string of 337 characters) once on page 23 and once on page
326.
No doubt this happened because the author copied and pasted some text
from
his notes, unaware that he had already copied and pasted the same text
earlier. I thought it would be a good idea to find out whether this
has
happened to the author more than one time in his 1,000,000-character
book,
so that I can alert him (to give him a chance to omit the repetition).
And so I turned to BBEdit. The text of the whole book is now in a txt
file.
When I search for the sentence that in the Word document is on page
23, I
can find it in BBEdit both in paragraph 117 and in paragraph 7831.
What
regular expression can I use to find other such repetitions?
I tried using the following string:
(?s)(.{200}).*?\1
This is what I understand it to mean (roughly):
(?s): search across paragraphs
(.{200}).*?: search for, and capture, a string of 200 characters,
optionally followed by any characters
\1: stop the search as soon as you reach a second instance of the
captured
string
The string does what I need if I replace 200 with a shorter number,
such as
10 (but in this case BBEdit finds a lot of unproblematic repetitions,
of
course). Given that the sentence I have in mind is more than 300
characters
long I should even have been able to use 300 instead of just 200.
Unfortunately, however, something seems to be amiss: BBEdit kept on
searching and searching, without finding anything, and my notebook
started
fanning, and after about 20 minutes it became clear that nothing would
happen, and that I cannot do anything else but to Force Quit BBEdit.
So my question is, what's wrong with the above string? How else can I
find
a repeated 200-character sentence in a large text file?
Thanks
Sam
--
This is the BBEdit Talk public discussion group. If you have a feature
request or need technical support, please email
"[email protected]" rather than posting here. Follow @bbedit on
Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google
Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com.
--
This is the BBEdit Talk public discussion group. If you have a feature request or need
technical support, please email "[email protected]" rather than posting here.
Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/bbedit/7B12B651-DCED-467C-A98F-6EE23FAE603B%40munkynet.org.