Re: CrawlDbReader TopN

Andrzej Bialecki Wed, 25 Jul 2007 08:33:47 -0700

(Please don't cross-post to multiple lists)

Emmanuel wrote:

I've been through the code of the CrawlDbReader class. I discovered the
method "processTopNJob" which use the class CrawlDbTopNMapper and
CrawlDbTopNReducer.

I'm wondering why do we have this function. Is it an old implementationthat

was used before the Generator to get the TopN links to Fetch or is it
something else ?
I would appreciate if you give me your thoughts.

It's not an old method, it's in use. See the synopsis inCrawlDbReader.main(). The purpose of this option is to dump the topscoring URLs, together with their scores. This is a useful functionalityto monitor CrawlDb for potential scoring problems.

I found also some class which are not used, "CrawlDbDumpReducer" itsdefined
but its never used or instanciate.
Don't you think we can remove it from the source code ?

Yes, we can remove this class - it's equivalent to IdentityReducer,which is used implicitly by this job. This class is a leftover from thetime, when it contained also some filtering code.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: CrawlDbReader TopN

Reply via email to