Re: OPIC

Massimo Miccoli Fri, 21 Oct 2005 10:16:58 -0700

Sorry Andrzej,

I mean on DeleteDuplicates.java, not in runtime. Is that the correctplace to integrate some like Shingling or n-gram?


Massimo

Andrzej Bialecki ha scritto:

Massimo Miccoli wrote:
Hi Doug,
Many thanks for your patch. I now try it. I'm also thinking tointegrate some algo for near duplicated urls detection. I mean somelike Shingling.
Is dedup the best place to integrate the algo?
That would be lovely. Dedup is the place to start, but certainly notthe place to stop... ;-)
I think we should introduce a separate "dedup" field for each page inthe DB. The reason is that if we re-use the md5 (or change itssemantics to mean "near duplicates covered by this value") then we runa risk of loosing a lot of legitimate unique urls from the DB.
Shingling, if you know how to implement it efficiently, wouldcertainly be nice - but we could start by just passing a "normalizedtext" to md5. By "normalized text" I mean all lowercase, stopwordsremoved, punctuation removed, any consecutive whitespace replaced withexactly 1 space character. We could also use an n-gram profile (eitherword-level or character level) with coarse quantization.

Re: OPIC

Reply via email to