Re: OPIC

Andrzej Bialecki Thu, 20 Oct 2005 11:40:44 -0700

Massimo Miccoli wrote:

Hi Doug,
Many thanks for your patch. I now try it. I'm also thinking to integratesome algo for near duplicated urls detection. I mean some like Shingling.
Is dedup the best place to integrate the algo?

That would be lovely. Dedup is the place to start, but certainly not theplace to stop... ;-)

I think we should introduce a separate "dedup" field for each page inthe DB. The reason is that if we re-use the md5 (or change its semanticsto mean "near duplicates covered by this value") then we run a risk ofloosing a lot of legitimate unique urls from the DB.

Shingling, if you know how to implement it efficiently, would certainlybe nice - but we could start by just passing a "normalized text" to md5.By "normalized text" I mean all lowercase, stopwords removed,punctuation removed, any consecutive whitespace replaced with exactly 1space character. We could also use an n-gram profile (either word-levelor character level) with coarse quantization.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: OPIC

Reply via email to