The answer is simple and not so simple at the same time. Last year we put in quite a bit of work to implement a stable PageRank like algorithm into Nutch. This was released as the new scoring and indexing frameworks. That give a good general relevancy score, but it is really a starting point.

Many people look at search engines and see a single algorithms, such as PageRank. In reality, a modern search engine, such as google or yahoo, will have hundreds of algorithms and jobs that contribute to relevancy of search results. This is because of two factors:

1) After getting good general relevancy (i.e. link analysis and such), search relevancy is about handling specific relevancy issues. For example handling reciprocal links, near duplicate detection, organizations that own 100k domains, template pages, blogs and echo chambers, hacked pages and blogs with link and keyword spam, malware, etc. Each of these types of issues, and there are many more, require specific algorithms to handle them.

Google and Yahoo would have algorithms (and people who specialize in certain areas) to handle all of these types of issues usually through statistical analysis and machine learning jobs. These jobs would then be aggregated together (think pipeline) to form final search engine relevancy scores.

In all fairness, this is offline relevancy. There would also be a considerable amount of work done on query parsing and online relevancy.

2) Relevancy scores change over time due to people and companies attempting to manipulate search results through SEO (both good and bad), through culture in general, and through search engines working through better algorithms.

So this is a long way of explaining that while Nutch has IMO a good general relevancy currently, taking it to the next level to where results are "as good as google" is going to take many different specialized MapReduce jobs that we currently don't have.

Dennis

atencorps wrote:
Nutch is a great search Engine and was recently pleased when the large multi
national I work for did some trials of Nutch Vs Google when we were
evaluating and looking for Enterprise search, was glad to say Nutch was a
worthy competitor thus Google Enterprise was chosen only due to office
politics (prefering large company over smaller etc ).

In terms of Enterprise Search I think Nutch already has it covered , my
question is towards Internet Search.

Thus Pagerank has been around for over 10 yrs and is what built Google. Are
there any newer more capable Ranking algorithms available, and also are
there any vision in terms of implementing a truely worthy ranking algorithm
into Nutch that could truely deliver quality Internet Search results like
Google ?.




Reply via email to