Hi,
I have tried to re crawl script which is available on
http://wiki.apache.org/nutch/IntranetRecrawl.
I have got following error.
2008-09-15 17:04:32,238 INFO fetcher.Fetcher - Fetcher: starting
2008-09-15 17:04:32,254 INFO fetcher.Fetcher - Fetcher: segment:
google/segments/20080915170335
2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher:
java.io.IOException: Segment already fetched!
at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
2008-09-15 17:04:35,144 INFO crawl.CrawlDb - CrawlDb update: starting
Plz. help me to solve this error.
Thanks in advance
Regards,
Chetan Patel
Hilkiah Lavinier wrote:
>
> Hi,
>
> I think I've come across an issue with the way hadoop lists files. Or
> maybe its just me...anyway, I'm using a modified version of the crawl
> script found on the wiki site. I'm trying to ensure that the fetch
> operation always uses the latest segment generated by the generate
> operation. thus the code looks like :
>
>
> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>
> echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays
> $adddays"
> $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays
>
> if [ $? -ne 0 ]
> then
> echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> break
> fi
>
> #debug
> ls -l --sort=t -r $crawl/segments
> $HADOOP dfs -ls $crawl/segments
>
> segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e
> 's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
> echo "** segment: $segment"
>
> echo "** $NUTCH fetch $segment -threads $threads"
> $NUTCH fetch $segment -threads $threads
>
>
>
> However every so often the crawl fails as per :
>
> --- Beginning crawl at depth 1 of 1 ---
> ** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments
> -adddays 24
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080417204644
> Generator: filtering: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> generate return value: 0
> total 8
> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
> Found 2 items
> /nutch/search/crawl/segments/20080417204644 <dir> 2008-04-17
> 20:46 rwxr-xr-x hilkiah hilkiah
> /nutch/search/crawl/segments/20080417204628 <dir> 2008-04-17
> 20:46 rwxr-xr-x hilkiah hilkiah
> ** segment: crawl/segments/20080417204628
> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628
> -threads 1000
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080417204628
> Fetcher: java.io.IOException: Segment already fetched!
> at
> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
> at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)
>
>
>
> I think the problem is that hadoop doesn't return the latest segment
> directory in chronological orde (or alphanumeric order). First, is this a
> known issue and if so, how do I work around it? Secondly, since I believe
> I refetch/reindex all pages (I only do depth=1 and run a series of crawls
> at depth=1), can I safely delete old segment/nnnnn folder before
> generating new one??
>
> Regards,
>
> Hilkiah G. Lavinier MEng (Hons), ACGI
> 6 Winston Lane,
> Goodwill,
> Roseau, Dominica
>
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
>
>
> Email: [EMAIL PROTECTED]
> Email: [EMAIL PROTECTED]
> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
> IM: ICQ #8978201 / AOL hilkiah21
>
>
>
>
>
>
> ____________________________________________________________________________________
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile. Try it now.
> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>
>
--
View this message in context:
http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491397.html
Sent from the Nutch - User mailing list archive at Nabble.com.