Eugen Kochuev wrote:
P.P.S Why not to develop efficient technique to fight near-duplicates
and SE spam? This is absolutely necessary if build Internet search

Why not, indeed? ;) The answer is that it is very difficult. There are simple methods that Nutch uses (MD5 and "text profile"), but generally speaking it is a difficult task. If you consider that pages may contain elements that are changing daily (such as date) or even with every request (ads, counters, banners, current time), or depending on the context (first request, subsequent requests), or may be composed from reusable parts (portlets), the problem doesn't seem so trivial anymore.

There is some (not much) literature on the subject, if you are interested I can send you some links - and of course we would gladly welcome any contributions in this area!

engine based on nutch. Another "must have" is variable refetch time
for pages (this could be based on estimating average update time of
the page + taking into account page score)

This is more or less ready to be committed. As it was discussed earlier on nutch-dev, since this is a significant change I'm waiting with the commit until after the release.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to