Thank you. I probably shouldn't have mentioned "sentences" because it may
well be that the author adjusted punctuation and capitalisation when
copying a string of words from his note file into the main file.
In the sample text file, I have therefore replaced all punctuation with a
space, and then replaced consecutive spaces with one space. (There are lots
of footnotes in the file, which I have deleted as well because they are not
relevant for my search.) Also, my search was not case sensitive. I simply
want to make sure no erroneous and embarrassing repetitions of the same
4-line-text or so occur.
So maybe regex isn't the right tool for this? If so I need to stop right
here since I have, regrettably, neither the knowledge nor the tools to work
with other languages.
On Monday, April 25, 2022 at 6:41:21 PM UTC+2 Bruce Van Allen wrote:
> That search pattern would start at the beginning of the text, grab the
> first 200 characters and search the rest of the text for that, starting
> with the very next (201st) character, and trying every 200-char sequence
> from there to the end. Then it would progress one character forward to
> character 2, grab it and the next 199, and repeat that search.
>
> Moving ahead one character at a time through the million characters until
> it finds a match or until the text no longer has 200 characters left would
> certainly take some processing time!
>
> Are the long strings always single sentences? If so, your pattern would be
> slightly optimized if didn’t accept the end of sentence character
> (“period”, “full stop”, “dot”).
>
> (?s)([^.]+){200}.*\1
>
> (Inside the character class brackets ‘.’ just means dot, not “any
> character”.)
>
> Given that the 200 is an arbitrary parameter (that is, you’re not looking
> for a string you already know, exactly that length), the above does NOT
> have an end of sentence character.
>
> Assuming standard English/European writing practice, the sentences could
> probably also be expected to start with an upper-case alpha character after
> a whitespace character, so the pattern would be faster as:
>
> (?s)(\s[A-Z][^.]+){200.*\1
>
> But the above suggestions won’t help much if you’re searching for strings
> with multiple sentences.
>
> HTH
>
> _bruce__van_allen__santa_cruz_ca_
> _831_429_1688_p_
> _831_332_3649_c_
>
> On Apr 25, 2022, at 8:42 AM, samar <[email protected]> wrote:
>
> Hi all
>
>
>
> While copyediting a text for a scholarly book (500+ pages when printed), I
> noticed that the author wrote exactly the same long sentence (= an
> identical string of 337 characters) once on page 23 and once on page 326.
> No doubt this happened because the author copied and pasted some text from
> his notes, unaware that he had already copied and pasted the same text
> earlier. I thought it would be a good idea to find out whether this has
> happened to the author more than one time in his 1,000,000-character book,
> so that I can alert him (to give him a chance to omit the repetition).
>
> And so I turned to BBEdit. The text of the whole book is now in a txt
> file. When I search for the sentence that in the Word document is on page
> 23, I can find it in BBEdit both in paragraph 117 and in paragraph 7831.
> What regular expression can I use to find other such repetitions?
>
> I tried using the following string:
>
> (?s)(.{200}).*?\1
>
> This is what I understand it to mean (roughly):
>
> (?s): search across paragraphs
> (.{200}).*?: search for, and capture, a string of 200 characters,
> optionally followed by any characters
> \1: stop the search as soon as you reach a second instance of the captured
> string
>
> The string does what I need if I replace 200 with a shorter number, such
> as 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of
> course). Given that the sentence I have in mind is more than 300 characters
> long I should even have been able to use 300 instead of just 200.
>
> Unfortunately, however, something seems to be amiss: BBEdit kept on
> searching and searching, without finding anything, and my notebook started
> fanning, and after about 20 minutes it became clear that nothing would
> happen, and that I cannot do anything else but to Force Quit BBEdit.
>
> So my question is, what's wrong with the above string? How else can I find
> a repeated 200-character sentence in a large text file?
>
> Thanks
> Sam
>
> --
>
> This is the BBEdit Talk public discussion group. If you have a feature
> request or need technical support, please email "[email protected]"
> rather than posting here. Follow @bbedit on Twitter: <
> https://twitter.com/bbedit>
> ---
> You received this message because you are subscribed to the Google Groups
> "BBEdit Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com
>
> <https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
--
This is the BBEdit Talk public discussion group. If you have a feature request
or need technical support, please email "[email protected]" rather than
posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/bbedit/c33865b7-2a50-48a9-a833-5e6d6418bed1n%40googlegroups.com.