Got it. I am going to document this on the wiki. Thanks. Steve -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Thursday, March 29, 2007 2:31 PM To: nutch-dev@lucene.apache.org Subject: Re: Sequence File Question
Steve Severance wrote: >> DB updates - or actually replacements - see e.g. CrawlDb.install() >> method for details. This is not needed in case of segments, which >> are created once and never updated. > > How does the reader know which one it is expecting. For instance I > can make a reader to read a linkDB just by instantiating it on the > directory crawl/linkdb And it knows to go inside the current > directory. What when opening a parse_data there is no current. So how > does it know which expect? Use The Source Luke ;) It follows this (arbitrary) naming convention that we always use a "current" subdirectory when working with LinkDb and CrawlDb. And it follows a different naming convention when we use SegmentReader. One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch classes. However, the real data is stored using Hadoop classes, specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming convention and always appends "current" to the db name. But if you were to use MapFileOutputFormat.getReaders() directly this Hadoop class of course doesn't know about this, so you need to provide a full path that includes "current". -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com