Re: OPIC

Andrzej Bialecki Fri, 21 Oct 2005 11:52:37 -0700

Massimo Miccoli wrote:

Sorry Andrzej,
I mean on DeleteDuplicates.java, not in runtime. Is that the correctplace to integrate some like Shingling or n-gram?

Yes. But there is an small issue of high dimensionality to solve,otherwise it will be very inefficient...

Both shingling and n-gram based methods (word n-gram or charactern-gram) produce a profile of a document, which can be compared to otherprofiles, one by one. So, this seems to be appropriate to detect nearduplicates - you create a profile for each document (in IndexDoc), andsort them... but here's where the problems start.

Usually such profiles take a lot of space (e.g. a list of 100 topn-grams), and comparing them takes a lot of resources - and severalcomparison operations are needed per item to sort the signatures. Thisis currently done by HashScore.

(BTW, HashScore is missing the fetchTime, which the original dedupalgorithm took also into account when comparing pages...).

So, you need to reduce the number of dimensions in a signature todecrease the complexity of compare operations. This can be done usingpurely numeric signatures (e.g. Nilsimsa - but this particular approachbrings numerous problems with quantization noise).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: OPIC

Reply via email to