Samar, first off this is a really cool challenge.

Personally I'd use JS as the tool of choice if only to have a lot of control 
over length of string, reporting, real-feedback and error-checking, fine 
tuning, handling multiple docs, and the ability to stop easily when the code 
errs.  (To be fair JS always my preferred go-to, so there's that.  People who 
can make this work with regex and regex alone always impress).  If you go JS lmk

Thanks for sharing. Following.



> On Apr 25, 2022, at 08:42, samar <[email protected]> wrote:
> Hi all
> 
> While copyediting a text for a scholarly book (500+ pages when printed), I 
> noticed that the author wrote exactly the same long sentence (= an identical 
> string of 337 characters) once on page 23 and once on page 326. No doubt this 
> happened because the author copied and pasted some text from his notes, 
> unaware that he had already copied and pasted the same text earlier. I 
> thought it would be a good idea to find out whether this has happened to the 
> author more than one time in his 1,000,000-character book, so that I can 
> alert him (to give him a chance to omit the repetition).
> 
> And so I turned to BBEdit. The text of the whole book is now in a txt file. 
> When I search for the sentence that in the Word document is on page 23, I can 
> find it in BBEdit both in paragraph 117 and in paragraph 7831. What regular 
> expression can I use to find other such repetitions?
> 
> I tried using the following string:
> 
> (?s)(.{200}).*?\1
> 
> This is what I understand it to mean (roughly):
> 
> (?s): search across paragraphs
> (.{200}).*?: search for, and capture, a string of 200 characters, optionally 
> followed by any characters
> \1: stop the search as soon as you reach a second instance of the captured 
> string
> 
> The string does what I need if I replace 200 with a shorter number, such as 
> 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of 
> course). Given that the sentence I have in mind is more than 300 characters 
> long I should even have been able to use 300 instead of just 200.
> 
> Unfortunately, however, something seems to be amiss: BBEdit kept on searching 
> and searching, without finding anything, and my notebook started fanning, and 
> after about 20 minutes it became clear that nothing would happen, and that I 
> cannot do anything else but to Force Quit BBEdit.
> 
> So my question is, what's wrong with the above string? How else can I find a 
> repeated 200-character sentence in a large text file?
> 
> Thanks
> Sam
> -- 
> This is the BBEdit Talk public discussion group. If you have a feature 
> request or need technical support, please email "[email protected]" 
> rather than posting here. Follow @bbedit on Twitter: 
> <https://twitter.com/bbedit>
> --- 
> You received this message because you are subscribed to the Google Groups 
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com.

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "[email protected]" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/DEB0C960-89E1-43C1-81D3-ED12E8845E73%40gmail.com.

Reply via email to