Massimo Miccoli wrote:
Sorry Andrzej,

I mean on DeleteDuplicates.java, not in runtime. Is that the correct place to integrate some like Shingling or n-gram?

Yes. But there is an small issue of high dimensionality to solve, otherwise it will be very inefficient...

Both shingling and n-gram based methods (word n-gram or character n-gram) produce a profile of a document, which can be compared to other profiles, one by one. So, this seems to be appropriate to detect near duplicates - you create a profile for each document (in IndexDoc), and sort them... but here's where the problems start.

Usually such profiles take a lot of space (e.g. a list of 100 top n-grams), and comparing them takes a lot of resources - and several comparison operations are needed per item to sort the signatures. This is currently done by HashScore.

(BTW, HashScore is missing the fetchTime, which the original dedup algorithm took also into account when comparing pages...).

So, you need to reduce the number of dimensions in a signature to decrease the complexity of compare operations. This can be done using purely numeric signatures (e.g. Nilsimsa - but this particular approach brings numerous problems with quantization noise).

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

  • Re: OPIC Massimo Miccoli
    • Re: OPIC Andrzej Bialecki

Reply via email to