Steve Severance wrote:
Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?

First, to answer your original question: you should use MapFileOutputFormat class for reading such output. It handles these part-xxxx subdirectories automatically.

Second, the "current" subdirectory is there in order to properly handle DB updates - or actually replacements - see e.g. CrawlDb.install() method for details. This is not needed in case of segments, which are created once and never updated.

Thirdly, although you didn't ask about it ;) the latest version of Hadoop contains a handy facility called Counters - if you use the PR PowerMethod you need to collect PR from dangling nodes in order to redistribute it later. You can use Counters for this, and save on a separate aggregation step.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to