Re: hadoop dfs -ls and nutch generate/fetch commands

Doğacan Güney Mon, 15 Sep 2008 04:57:46 -0700

Hi,

On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> I have tried to re crawl script which is available on
> http://wiki.apache.org/nutch/IntranetRecrawl.
>
> I have got following error.
>
> 2008-09-15 17:04:32,238 INFO  fetcher.Fetcher - Fetcher: starting
> 2008-09-15 17:04:32,254 INFO  fetcher.Fetcher - Fetcher: segment:
> google/segments/20080915170335
> 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
> java.io.IOException: Segment already fetched!
>        at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>
> 2008-09-15 17:04:35,144 INFO  crawl.CrawlDb - CrawlDb update: starting
>
> Plz. help me to solve this error.
>


Segment you are trying to crawl is already fetched. Try removing
everything but crawl_generate under that segment.

> Thanks in advance
>
> Regards,
> Chetan Patel
>
>
>
> Hilkiah Lavinier wrote:
>>
>> Hi,
>>
>> I think I've come across an issue with the way hadoop lists files.  Or
>> maybe its just me...anyway, I'm using a modified version of the crawl
>> script found on the wiki site.  I'm trying to ensure that the fetch
>> operation always uses the latest segment generated by the generate
>> operation.  thus  the code looks like :
>>
>>
>>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>
>>   echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays
>> $adddays"
>>   $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays
>>
>>   if [ $? -ne 0 ]
>>   then
>>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>>     break
>>   fi
>>
>>   #debug
>>   ls -l --sort=t -r $crawl/segments
>>   $HADOOP dfs -ls $crawl/segments
>>
>>   segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e
>> 's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
>>   echo "** segment: $segment"
>>
>>   echo "** $NUTCH fetch $segment -threads $threads"
>>   $NUTCH fetch $segment -threads $threads
>>
>>
>>
>> However every so often the crawl fails as per :
>>
>> --- Beginning crawl at depth 1 of 1 ---
>> ** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments
>> -adddays 24
>> Generator: Selecting best-scoring urls due for fetch.
>> Generator: starting
>> Generator: segment: crawl/segments/20080417204644
>> Generator: filtering: true
>> Generator: jobtracker is 'local', generating exactly one partition.
>> Generator: Partitioning selected urls by host, for politeness.
>> Generator: done.
>> generate return value: 0
>> total 8
>> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
>> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
>> Found 2 items
>> /nutch/search/crawl/segments/20080417204644     <dir>           2008-04-17
>> 20:46        rwxr-xr-x       hilkiah hilkiah
>> /nutch/search/crawl/segments/20080417204628     <dir>           2008-04-17
>> 20:46        rwxr-xr-x       hilkiah hilkiah
>> ** segment: crawl/segments/20080417204628
>> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628
>> -threads 1000
>> Fetcher: starting
>> Fetcher: segment: crawl/segments/20080417204628
>> Fetcher: java.io.IOException: Segment already fetched!
>>         at
>> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
>>         at
>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
>>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
>>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)
>>
>>
>>
>> I think the problem is that hadoop doesn't return the latest segment
>> directory in chronological orde (or alphanumeric order).  First, is this a
>> known issue and if so, how do I work around it?  Secondly, since I believe
>> I refetch/reindex all pages (I only do depth=1 and run a series of crawls
>> at depth=1), can I safely delete old segment/nnnnn folder before
>> generating new one??
>>
>> Regards,
>>
>> Hilkiah G. Lavinier MEng (Hons), ACGI
>> 6 Winston Lane,
>> Goodwill,
>> Roseau, Dominica
>>
>> Mbl: (767) 275 3382
>> Hm : (767) 440 3924
>> Fax: (767) 440 4991
>> VoIP USA: (646) 432 4487
>>
>>
>> Email: [EMAIL PROTECTED]
>> Email: [EMAIL PROTECTED]
>> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
>> IM: ICQ #8978201  / AOL hilkiah21
>>
>>
>>
>>
>>
>>
>> ____________________________________________________________________________________
>> Be a better friend, newshound, and
>> know-it-all with Yahoo! Mobile.  Try it now.
>> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491397.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
Doğacan Güney

Re: hadoop dfs -ls and nutch generate/fetch commands

Reply via email to