Thank you. I probably shouldn't have mentioned "sentences" because it may 
well be that the author adjusted punctuation and capitalisation when 
copying a string of words from his note file into the main file.

In the sample text file, I have therefore replaced all punctuation with a 
space, and then replaced consecutive spaces with one space. (There are lots 
of footnotes in the file, which I have deleted as well because they are not 
relevant for my search.) Also, my search was not case sensitive. I simply 
want to make sure no erroneous and embarrassing repetitions of the same 
4-line-text or so occur.

So maybe regex isn't the right tool for this? If so I need to stop right 
here since I have, regrettably, neither the knowledge nor the tools to work 
with other languages.
On Monday, April 25, 2022 at 6:41:21 PM UTC+2 Bruce Van Allen wrote:

> That search pattern would start at the beginning of the text, grab the 
> first 200 characters and search the rest of the text for that, starting 
> with the very next (201st) character, and trying every 200-char sequence 
> from there to the end. Then it would progress one character forward to 
> character 2, grab it and the next 199, and repeat that search. 
>
> Moving ahead one character at a time through the million characters until 
> it finds a match or until the text no longer has 200 characters left would 
> certainly take some processing time!
>
> Are the long strings always single sentences? If so, your pattern would be 
> slightly optimized if didn’t accept the end of sentence character 
> (“period”, “full stop”, “dot”).
>
> (?s)([^.]+){200}.*\1
>
> (Inside the character class brackets ‘.’ just means dot, not “any 
> character”.)
>
> Given that the 200 is an arbitrary parameter (that is, you’re not looking 
> for a string you already know, exactly that length), the above does NOT 
> have an end of sentence character.
>
> Assuming standard English/European writing practice, the sentences could 
> probably also be expected to start with an upper-case alpha character after 
> a whitespace character, so the pattern would be faster as:
>
> (?s)(\s[A-Z][^.]+){200.*\1
>
> But the above suggestions won’t help much if you’re searching for strings 
> with multiple sentences.
>
> HTH
>
> _bruce__van_allen__santa_cruz_ca_
> _831_429_1688_p_
> _831_332_3649_c_
>
> On Apr 25, 2022, at 8:42 AM, samar <[email protected]> wrote:
>
> Hi all
>
>
>
> While copyediting a text for a scholarly book (500+ pages when printed), I 
> noticed that the author wrote exactly the same long sentence (= an 
> identical string of 337 characters) once on page 23 and once on page 326. 
> No doubt this happened because the author copied and pasted some text from 
> his notes, unaware that he had already copied and pasted the same text 
> earlier. I thought it would be a good idea to find out whether this has 
> happened to the author more than one time in his 1,000,000-character book, 
> so that I can alert him (to give him a chance to omit the repetition).
>
> And so I turned to BBEdit. The text of the whole book is now in a txt 
> file. When I search for the sentence that in the Word document is on page 
> 23, I can find it in BBEdit both in paragraph 117 and in paragraph 7831. 
> What regular expression can I use to find other such repetitions?
>
> I tried using the following string:
>
> (?s)(.{200}).*?\1
>
> This is what I understand it to mean (roughly):
>
> (?s): search across paragraphs
> (.{200}).*?: search for, and capture, a string of 200 characters, 
> optionally followed by any characters
> \1: stop the search as soon as you reach a second instance of the captured 
> string
>
> The string does what I need if I replace 200 with a shorter number, such 
> as 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of 
> course). Given that the sentence I have in mind is more than 300 characters 
> long I should even have been able to use 300 instead of just 200.
>
> Unfortunately, however, something seems to be amiss: BBEdit kept on 
> searching and searching, without finding anything, and my notebook started 
> fanning, and after about 20 minutes it became clear that nothing would 
> happen, and that I cannot do anything else but to Force Quit BBEdit.
>
> So my question is, what's wrong with the above string? How else can I find 
> a repeated 200-character sentence in a large text file?
>
> Thanks
> Sam
>
> -- 
>
> This is the BBEdit Talk public discussion group. If you have a feature 
> request or need technical support, please email "[email protected]" 
> rather than posting here. Follow @bbedit on Twitter: <
> https://twitter.com/bbedit>
> --- 
> You received this message because you are subscribed to the Google Groups 
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>

-- 
This is the BBEdit Talk public discussion group. If you have a feature request 
or need technical support, please email "[email protected]" rather than 
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/bbedit/c33865b7-2a50-48a9-a833-5e6d6418bed1n%40googlegroups.com.

Reply via email to