Re: hadoop dfs -ls and nutch generate/fetch commands

John Mendenhall Fri, 18 Apr 2008 10:50:06 -0700

> total 8
> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
> Found 2 items
> /nutch/search/crawl/segments/20080417204644     <dir>           2008-04-17 
> 20:46        rwxr-xr-x       hilkiah hilkiah
> /nutch/search/crawl/segments/20080417204628     <dir>           2008-04-17 
> 20:46        rwxr-xr-x       hilkiah hilkiah
> ** segment: crawl/segments/20080417204628
> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628 
> -threads 1000
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080417204628
> Fetcher: java.io.IOException: Segment already fetched!
> 
> I think the problem is that hadoop doesn't return the latest segment 
> directory in chronological orde (or alphanumeric order).  First, is this a 
> known issue and if so, how do I work around it?  Secondly, since I believe I 
> refetch/reindex all pages (I only do depth=1 and run a series of crawls at 
> depth=1), can I safely delete old segment/nnnnn folder before generating new 
> one??


Absolutely correct.  Hadoop dfs ls does not always list files
in order.  Pipe it through sort, before you use tail.

You can only delete old segments after the refetch time has
surpassed for that segment, and all entries in that segment
have been refetced.

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

Re: hadoop dfs -ls and nutch generate/fetch commands

Reply via email to