RE: Sequence File Question

Steve Severance Thu, 29 Mar 2007 10:47:55 -0800

Got it. I am going to document this on the wiki. Thanks.

Steve
-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 29, 2007 2:31 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Sequence File Question

Steve Severance wrote:
>> DB updates - or actually replacements - see e.g. CrawlDb.install()
>>  method for details. This is not needed in case of segments, which 
>> are created once and never updated.
> 
> How does the reader know which one it is expecting. For instance I 
> can make a reader to read a linkDB just by instantiating it on the 
> directory crawl/linkdb And it knows to go inside the current 
> directory. What when opening a parse_data there is no current. So how
>  does it know which expect?

Use The Source Luke ;) It follows this (arbitrary) naming convention
that we always use a "current" subdirectory when working with LinkDb and
CrawlDb. And it follows a different naming convention when we use
SegmentReader.

One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch
classes. However, the real data is stored using Hadoop classes,
specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming
convention and always appends "current" to the db name. But if you were
to use MapFileOutputFormat.getReaders() directly this Hadoop class of
course doesn't know about this, so you need to provide a full path that
includes "current".

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Sequence File Question

Reply via email to