Steve Severance wrote:
Let me actually refine that question we do some directories like the linkdb have a current, and why do others like parse_data not? Is there a convention on this?
First, to answer your original question: you should use MapFileOutputFormat class for reading such output. It handles these part-xxxx subdirectories automatically.
Second, the "current" subdirectory is there in order to properly handle DB updates - or actually replacements - see e.g. CrawlDb.install() method for details. This is not needed in case of segments, which are created once and never updated.
Thirdly, although you didn't ask about it ;) the latest version of Hadoop contains a handy facility called Counters - if you use the PR PowerMethod you need to collect PR from dangling nodes in order to redistribute it later. You can use Counters for this, and save on a separate aggregation step.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com