> -----Original Message----- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 28, 2007 4:34 PM > To: nutch-dev@lucene.apache.org > Subject: Re: Sequence File Question > > Steve Severance wrote: > > Let me actually refine that question we do some directories like the > linkdb > > have a current, and why do others like parse_data not? Is there a > convention > > on this? > > First, to answer your original question: you should use > MapFileOutputFormat class for reading such output. It handles these > part-xxxx subdirectories automatically. > > Second, the "current" subdirectory is there in order to properly handle > DB updates - or actually replacements - see e.g. CrawlDb.install() > method for details. This is not needed in case of segments, which are > created once and never updated.
How does the reader know which one it is expecting. For instance I can make a reader to read a linkDB just by instantiating it on the directory crawl/linkdb And it knows to go inside the current directory. What when opening a parse_data there is no current. So how does it know which expect? Steve > > Thirdly, although you didn't ask about it ;) the latest version of > Hadoop contains a handy facility called Counters - if you use the PR > PowerMethod you need to collect PR from dangling nodes in order to > redistribute it later. You can use Counters for this, and save on a > separate aggregation step. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com