Re: Sequence File Question

Andrzej Bialecki Wed, 28 Mar 2007 12:34:51 -0800

Steve Severance wrote:

Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?

First, to answer your original question: you should useMapFileOutputFormat class for reading such output. It handles thesepart-xxxx subdirectories automatically.

Second, the "current" subdirectory is there in order to properly handleDB updates - or actually replacements - see e.g. CrawlDb.install()method for details. This is not needed in case of segments, which arecreated once and never updated.

Thirdly, although you didn't ask about it ;) the latest version ofHadoop contains a handy facility called Counters - if you use the PRPowerMethod you need to collect PR from dangling nodes in order toredistribute it later. You can use Counters for this, and save on aseparate aggregation step.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Sequence File Question

Reply via email to