Massimo Miccoli wrote:
Hi Doug,
Many thanks for your patch. I now try it. I'm also thinking to integrate
some algo for near duplicated urls detection. I mean some like Shingling.
Is dedup the best place to integrate the algo?
That would be lovely. Dedup is the place to start, but certainly not the
place to stop... ;-)
I think we should introduce a separate "dedup" field for each page in
the DB. The reason is that if we re-use the md5 (or change its semantics
to mean "near duplicates covered by this value") then we run a risk of
loosing a lot of legitimate unique urls from the DB.
Shingling, if you know how to implement it efficiently, would certainly
be nice - but we could start by just passing a "normalized text" to md5.
By "normalized text" I mean all lowercase, stopwords removed,
punctuation removed, any consecutive whitespace replaced with exactly 1
space character. We could also use an n-gram profile (either word-level
or character level) with coarse quantization.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com