RE: Sequence File Question

Steve Severance Thu, 29 Mar 2007 06:30:20 -0800

> -----Original Message-----
> From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 28, 2007 4:34 PM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Sequence File Question
> 
> Steve Severance wrote:
> > Let me actually refine that question we do some directories like the
> linkdb
> > have a current, and why do others like parse_data not? Is there a
> convention
> > on this?
> 
> First, to answer your original question: you should use
> MapFileOutputFormat class for reading such output. It handles these
> part-xxxx subdirectories automatically.
> 
> Second, the "current" subdirectory is there in order to properly handle
> DB updates - or actually replacements - see e.g. CrawlDb.install()
> method for details. This is not needed in case of segments, which are
> created once and never updated.


How does the reader know which one it is expecting. For instance I can make a 
reader to read a linkDB just by instantiating it on the directory crawl/linkdb 
And it knows to go inside the current directory. What when opening a parse_data 
there is no current. So how does it know which expect?

Steve

> 
> Thirdly, although you didn't ask about it ;) the latest version of
> Hadoop contains a handy facility called Counters - if you use the PR
> PowerMethod you need to collect PR from dangling nodes in order to
> redistribute it later. You can use Counters for this, and save on a
> separate aggregation step.
> 
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

RE: Sequence File Question

Reply via email to