Hi, On Mon, Sep 15, 2008 at 2:43 PM, Chetan Patel <[EMAIL PROTECTED]> wrote: > > Hi, > > I have tried to re crawl script which is available on > http://wiki.apache.org/nutch/IntranetRecrawl. > > I have got following error. > > 2008-09-15 17:04:32,238 INFO fetcher.Fetcher - Fetcher: starting > 2008-09-15 17:04:32,254 INFO fetcher.Fetcher - Fetcher: segment: > google/segments/20080915170335 > 2008-09-15 17:04:32,972 FATAL fetcher.Fetcher - Fetcher: > java.io.IOException: Segment already fetched! > at > org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:46) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:329) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) > > 2008-09-15 17:04:35,144 INFO crawl.CrawlDb - CrawlDb update: starting > > Plz. help me to solve this error. >
Segment you are trying to crawl is already fetched. Try removing everything but crawl_generate under that segment. > Thanks in advance > > Regards, > Chetan Patel > > > > Hilkiah Lavinier wrote: >> >> Hi, >> >> I think I've come across an issue with the way hadoop lists files. Or >> maybe its just me...anyway, I'm using a modified version of the crawl >> script found on the wiki site. I'm trying to ensure that the fetch >> operation always uses the latest segment generated by the generate >> operation. thus the code looks like : >> >> >> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" >> >> echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays >> $adddays" >> $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays >> >> if [ $? -ne 0 ] >> then >> echo "runbot: Stopping at depth $depth. No more URLs to fetch." >> break >> fi >> >> #debug >> ls -l --sort=t -r $crawl/segments >> $HADOOP dfs -ls $crawl/segments >> >> segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e >> 's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'` >> echo "** segment: $segment" >> >> echo "** $NUTCH fetch $segment -threads $threads" >> $NUTCH fetch $segment -threads $threads >> >> >> >> However every so often the crawl fails as per : >> >> --- Beginning crawl at depth 1 of 1 --- >> ** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments >> -adddays 24 >> Generator: Selecting best-scoring urls due for fetch. >> Generator: starting >> Generator: segment: crawl/segments/20080417204644 >> Generator: filtering: true >> Generator: jobtracker is 'local', generating exactly one partition. >> Generator: Partitioning selected urls by host, for politeness. >> Generator: done. >> generate return value: 0 >> total 8 >> drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628 >> drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644 >> Found 2 items >> /nutch/search/crawl/segments/20080417204644 <dir> 2008-04-17 >> 20:46 rwxr-xr-x hilkiah hilkiah >> /nutch/search/crawl/segments/20080417204628 <dir> 2008-04-17 >> 20:46 rwxr-xr-x hilkiah hilkiah >> ** segment: crawl/segments/20080417204628 >> ** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628 >> -threads 1000 >> Fetcher: starting >> Fetcher: segment: crawl/segments/20080417204628 >> Fetcher: java.io.IOException: Segment already fetched! >> at >> org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49) >> at >> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805) >> at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524) >> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531) >> >> >> >> I think the problem is that hadoop doesn't return the latest segment >> directory in chronological orde (or alphanumeric order). First, is this a >> known issue and if so, how do I work around it? Secondly, since I believe >> I refetch/reindex all pages (I only do depth=1 and run a series of crawls >> at depth=1), can I safely delete old segment/nnnnn folder before >> generating new one?? >> >> Regards, >> >> Hilkiah G. Lavinier MEng (Hons), ACGI >> 6 Winston Lane, >> Goodwill, >> Roseau, Dominica >> >> Mbl: (767) 275 3382 >> Hm : (767) 440 3924 >> Fax: (767) 440 4991 >> VoIP USA: (646) 432 4487 >> >> >> Email: [EMAIL PROTECTED] >> Email: [EMAIL PROTECTED] >> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED] >> IM: ICQ #8978201 / AOL hilkiah21 >> >> >> >> >> >> >> ____________________________________________________________________________________ >> Be a better friend, newshound, and >> know-it-all with Yahoo! Mobile. Try it now. >> http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ >> >> > > -- > View this message in context: > http://www.nabble.com/hadoop-dfs--ls-and-nutch-generate-fetch-commands-tp16758617p19491397.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
