On 24 January 2018 at 15:39, Erik Josefsson <[email protected]> wrote: > Hello, > > Meld is absolutely great, so I thought that I could maybe ask on this > list if anyone here have seen a free implementation of a "longest common > substring" algorithm?
This isn't quite longest common substring! It's longest repeated substring, which is a slightly different problem. Looking around I stumbled across https://github.com/Daniel-Hug/longest-repeated-substring, which seems... pretty okay? It's just a suffix tree implementation with a front end. > I often find repeated phrases, or even snippets of texts, in policy > documents, and I am looking for a quicker way to find them than myself. > > I think there are plagiarism-tools out there that can do this, but I'm > looking for something smalller that can present the "sims" just as > beautifully as the "diffs" in one single text. > > If such tool does not exist yet, can I put it on a wish-list for Meld? While I can see the similarity, I'm not sure this is a good fit for Meld. Meld is fairly focused on code and similar line-based comparison use cases. There's many, many things that need to be done differently when comparing natural language, and we don't do any of that. On the upside, I think it would definitely be possible to cobble something simple together with the above as a starting point. You'd just need... a bit of pre-processing (normalise case, whitespace, etc.), and probably some smarts to pick thresholds on the output side. cheers, Kai _______________________________________________ meld-list mailing list [email protected] https://mail.gnome.org/mailman/listinfo/meld-list
