Here is a BBEdit Text Filter that will scan the frontmost document's selected text (or the whole document in no selection) for the longest repeated substring and log a regular expression in the 'Unix Script Output.log' that allows to find the repetition.
It works by replacing all non alphanumeric or underscore characters with a regular expression, thus comparing only the 'text' and ignoring anything else: whitespace, punctuation, math operators, etc. As a consequence, it's more useful on textual content than on code. It is derived from the go version of Ukkonen’s suffix tree construction <https://rosettacode.org/wiki/Ukkonen%E2%80%99s_suffix_tree_construction> and uses the gorun utility <https://github.com/erning/gorun> that allows to execute a go source file as a shell script. 1. To install Go • from Go's installer: https://go.dev/doc/install • or with Homebrew at the terminal: % brew install go 2. To install gorun at the terminal: % go install github.com/erning/gorun@latest 3. Copy the file find_longest_repeated_substring.go <https://gist.github.com/mixio/8258238888164c54be3f4d32d3ff2dc2> to ~/Library/Application Support/BBEdit/Text Filters 4. Use the text filter from the *Text menu > Apply Text Filter > find_longest_repeated_substring* 5. In case the 'Unix Script Output.log' doesn't show up, use the *Menu Go > Commands...* panel to find the 'Unix Script Output.log' 6. In the 'Unix Script Output.log', select the logged regular expression and copy it to the find window with *<Command-Shift-E>*. (Warning: Do not copy the trailing return character as it is not part of the regular expression) 7. Activate you document and use the find window to search the occurrences of the repetition. (Warning: if the repetition is very long, the generated regular expression might not compile. Just select a shorter portion from its beginning to some sensible length.) HTH, Jean Jourdain On Tuesday, April 26, 2022 at 8:15:06 AM UTC+2 [email protected] wrote: > The online demo you linked to, Sam, does exactly what I need! I've found a > few other doublets in the file that clearly should not be there. Thank you > all for your thoughts and inputs. > > On Monday, April 25, 2022 at 8:38:09 PM UTC+2 Sam Hathaway wrote: > >> This sounds like a case of the Longest repeated substring problem >> <https://en.wikipedia.org/wiki/Longest_repeated_substring_problem>. >> Regular expressions are not the right tool for the job, unfortunately. >> >> There’s an online demo >> <https://daniel-hug.github.io/longest-repeated-substring/> that might do >> what you need. >> >> If you want to find all long repeated substrings, you can take an >> iterative approach: find the longest, remove the duplicates from the source >> text, and again find the longest. >> >> Hope this helps, >> -sam >> >> On 25 Apr 2022, at 11:42, samar wrote: >> >> Hi all >> >> While copyediting a text for a scholarly book (500+ pages when printed), >> I noticed that the author wrote exactly the same long sentence (= an >> identical string of 337 characters) once on page 23 and once on page 326. >> No doubt this happened because the author copied and pasted some text from >> his notes, unaware that he had already copied and pasted the same text >> earlier. I thought it would be a good idea to find out whether this has >> happened to the author more than one time in his 1,000,000-character book, >> so that I can alert him (to give him a chance to omit the repetition). >> >> And so I turned to BBEdit. The text of the whole book is now in a txt >> file. When I search for the sentence that in the Word document is on page >> 23, I can find it in BBEdit both in paragraph 117 and in paragraph 7831. >> What regular expression can I use to find other such repetitions? >> >> I tried using the following string: >> >> (?s)(.{200}).*?\1 >> >> This is what I understand it to mean (roughly): >> >> (?s): search across paragraphs >> (.{200}).*?: search for, and capture, a string of 200 characters, >> optionally followed by any characters >> \1: stop the search as soon as you reach a second instance of the >> captured string >> >> The string does what I need if I replace 200 with a shorter number, such >> as 10 (but in this case BBEdit finds a lot of unproblematic repetitions, of >> course). Given that the sentence I have in mind is more than 300 characters >> long I should even have been able to use 300 instead of just 200. >> >> Unfortunately, however, something seems to be amiss: BBEdit kept on >> searching and searching, without finding anything, and my notebook started >> fanning, and after about 20 minutes it became clear that nothing would >> happen, and that I cannot do anything else but to Force Quit BBEdit. >> >> So my question is, what's wrong with the above string? How else can I >> find a repeated 200-character sentence in a large text file? >> >> Thanks >> Sam >> >> -- >> This is the BBEdit Talk public discussion group. If you have a feature >> request or need technical support, please email "[email protected]" >> rather than posting here. Follow @bbedit on Twitter: < >> https://twitter.com/bbedit> >> --- >> You received this message because you are subscribed to the Google Groups >> "BBEdit Talk" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/bbedit/b068a68d-28c7-44af-8994-7c3424ed0befn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> -- This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "[email protected]" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit> --- You received this message because you are subscribed to the Google Groups "BBEdit Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/e2074d77-7f98-4d5b-9a9f-fece2a32533cn%40googlegroups.com.
