Re: stemming

Andrzej Bialecki Wed, 28 Jun 2006 03:15:38 -0700

Eugen Kochuev wrote:

P.P.S Why not to develop efficient technique to fight near-duplicates
and SE spam? This is absolutely necessary if build Internet search

Why not, indeed? ;) The answer is that it is very difficult. There aresimple methods that Nutch uses (MD5 and "text profile"), but generallyspeaking it is a difficult task. If you consider that pages may containelements that are changing daily (such as date) or even with everyrequest (ads, counters, banners, current time), or depending on thecontext (first request, subsequent requests), or may be composed fromreusable parts (portlets), the problem doesn't seem so trivial anymore.

There is some (not much) literature on the subject, if you areinterested I can send you some links - and of course we would gladlywelcome any contributions in this area!

engine based on nutch. Another "must have" is variable refetch time
for pages (this could be based on estimating average update time of
the page + taking into account page score)

This is more or less ready to be committed. As it was discussed earlieron nutch-dev, since this is a significant change I'm waiting with thecommit until after the release.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: stemming

Reply via email to