(Please don't cross-post to multiple lists)
Emmanuel wrote:
I've been through the code of the CrawlDbReader class. I discovered the
method "processTopNJob" which use the class CrawlDbTopNMapper and
CrawlDbTopNReducer.
I'm wondering why do we have this function. Is it an old implementation
that
was used before the Generator to get the TopN links to Fetch or is it
something else ?
I would appreciate if you give me your thoughts.
It's not an old method, it's in use. See the synopsis in
CrawlDbReader.main(). The purpose of this option is to dump the top
scoring URLs, together with their scores. This is a useful functionality
to monitor CrawlDb for potential scoring problems.
I found also some class which are not used, "CrawlDbDumpReducer" its
defined
but its never used or instanciate.
Don't you think we can remove it from the source code ?
Yes, we can remove this class - it's equivalent to IdentityReducer,
which is used implicitly by this job. This class is a leftover from the
time, when it contained also some filtering code.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com