Massimo Miccoli wrote:
Sorry Andrzej,
I mean on DeleteDuplicates.java, not in runtime. Is that the correct
place to integrate some like Shingling or n-gram?
Yes. But there is an small issue of high dimensionality to solve,
otherwise it will be very inefficient...
Both shingling and n-gram based methods (word n-gram or character
n-gram) produce a profile of a document, which can be compared to other
profiles, one by one. So, this seems to be appropriate to detect near
duplicates - you create a profile for each document (in IndexDoc), and
sort them... but here's where the problems start.
Usually such profiles take a lot of space (e.g. a list of 100 top
n-grams), and comparing them takes a lot of resources - and several
comparison operations are needed per item to sort the signatures. This
is currently done by HashScore.
(BTW, HashScore is missing the fetchTime, which the original dedup
algorithm took also into account when comparing pages...).
So, you need to reduce the number of dimensions in a signature to
decrease the complexity of compare operations. This can be done using
purely numeric signatures (e.g. Nilsimsa - but this particular approach
brings numerous problems with quantization noise).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com