Re: hadoop dfs -ls and nutch generate/fetch commands

Chetan Patel Mon, 15 Sep 2008 04:43:55 -0700

Hi,

I have tried to re crawl script which is available on
http://wiki.apache.org/nutch/IntranetRecrawl.


I have got following error.

2008-09-15 17:04:32,238 INFO  fetcher.Fetcher - Fetcher: starting
2008-09-15 17:04:32,254 INFO  fetcher.Fetcher - Fetcher: segment:
google/segments/20080915170335
2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
java.io.IOException: Segment already fetched!
        at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)

2008-09-15 17:04:35,144 INFO  crawl.CrawlDb - CrawlDb update: starting

Plz. help me to solve this error.

Thanks in advance

Regards,
Chetan Patel



Hilkiah Lavinier wrote:
> 
> Hi,
> 
> I think I've come across an issue with the way hadoop lists files.  Or
> maybe its just me...anyway, I'm using a modified version of the crawl
> script found on the wiki site.  I'm trying to ensure that the fetch
> operation always uses the latest segment generated by the generate
> operation.  thus  the code looks like :
> 
> 
>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> 
>   echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays
> $adddays"
>   $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays
> 
>   if [ $? -ne 0 ]
>   then
>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>     break
>   fi
> 
>   #debug
>   ls -l --sort=t -r $crawl/segments
>   $HADOOP dfs -ls $crawl/segments
> 
>   segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e
> 's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
>   echo "** segment: $segment"
> 
>   echo "** $NUTCH fetch $segment -threads $threads"
>   $NUTCH fetch $segment -threads $threads
> 
> 
> 
> However every so often the crawl fails as per :
> 
> --- Beginning crawl at depth 1 of 1 ---
> ** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments 
> -adddays 24
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080417204644
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> generate return value: 0
> total 8
> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
> Found 2 items
> /nutch/search/crawl/segments/20080417204644     <dir>           2008-04-17
> 20:46        rwxr-xr-x       hilkiah hilkiah
> /nutch/search/crawl/segments/20080417204628     <dir>           2008-04-17
> 20:46        rwxr-xr-x       hilkiah hilkiah
> ** segment: crawl/segments/20080417204628
> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628
> -threads 1000
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080417204628
> Fetcher: java.io.IOException: Segment already fetched!
>         at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
>         at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)
> 
> 
> 
> I think the problem is that hadoop doesn't return the latest segment
> directory in chronological orde (or alphanumeric order).  First, is this a
> known issue and if so, how do I work around it?  Secondly, since I believe
> I refetch/reindex all pages (I only do depth=1 and run a series of crawls
> at depth=1), can I safely delete old segment/nnnnn folder before
> generating new one??
> 
> Regards,
>  
> Hilkiah G. Lavinier MEng (Hons), ACGI 
> 6 Winston Lane, 
> Goodwill, 
> Roseau, Dominica
> 
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
> 
> 
> Email: [EMAIL PROTECTED]
> Email: [EMAIL PROTECTED]
> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
> IM: ICQ #8978201  / AOL hilkiah21
> 
> 
> 
> 
> 
>      
> ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now. 
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> 
> 

-- 
View this message in context: 
http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491397.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: hadoop dfs -ls and nutch generate/fetch commands

Reply via email to