On 24 January 2018 at 15:39, Erik Josefsson
<[email protected]> wrote:
> Hello,
>
> Meld is absolutely great, so I thought that I could maybe ask on this
> list if anyone here have seen a free implementation of a "longest common
> substring" algorithm?

This isn't quite longest common substring! It's longest repeated
substring, which is a slightly different problem. Looking around I
stumbled across
https://github.com/Daniel-Hug/longest-repeated-substring, which
seems... pretty okay? It's just a suffix tree implementation with a
front end.

> I often find repeated phrases, or even snippets of texts, in policy
> documents, and I am looking for a quicker way to find them than myself.
>
> I think there are plagiarism-tools out there that can do this, but I'm
> looking for something smalller that can present the "sims" just as
> beautifully as the "diffs" in one single text.
>
> If such tool does not exist yet, can I put it on a wish-list for Meld?

While I can see the similarity, I'm not sure this is a good fit for
Meld. Meld is fairly focused on code and similar line-based comparison
use cases. There's many, many things that need to be done differently
when comparing natural language, and we don't do any of that.

On the upside, I think it would definitely be possible to cobble
something simple together with the above as a starting point. You'd
just need... a bit of pre-processing (normalise case, whitespace,
etc.), and probably some smarts to pick thresholds on the output side.

cheers,
Kai
_______________________________________________
meld-list mailing list
[email protected]
https://mail.gnome.org/mailman/listinfo/meld-list

Reply via email to