Hi,
I think I've come across an issue with the way hadoop lists files. Or maybe
its just me...anyway, I'm using a modified version of the crawl script found on
the wiki site. I'm trying to ensure that the fetch operation always uses the
latest segment generated by the generate operation. thus the code looks like :
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays
$adddays"
$NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
#debug
ls -l --sort=t -r $crawl/segments
$HADOOP dfs -ls $crawl/segments
segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e
's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
echo "** segment: $segment"
echo "** $NUTCH fetch $segment -threads $threads"
$NUTCH fetch $segment -threads $threads
However every so often the crawl fails as per :
--- Beginning crawl at depth 1 of 1 ---
** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments
-adddays 24
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080417204644
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
generate return value: 0
total 8
drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
Found 2 items
/nutch/search/crawl/segments/20080417204644 <dir> 2008-04-17
20:46 rwxr-xr-x hilkiah hilkiah
/nutch/search/crawl/segments/20080417204628 <dir> 2008-04-17
20:46 rwxr-xr-x hilkiah hilkiah
** segment: crawl/segments/20080417204628
** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628 -threads
1000
Fetcher: starting
Fetcher: segment: crawl/segments/20080417204628
Fetcher: java.io.IOException: Segment already fetched!
at
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)
I think the problem is that hadoop doesn't return the latest segment directory
in chronological orde (or alphanumeric order). First, is this a known issue
and if so, how do I work around it? Secondly, since I believe I
refetch/reindex all pages (I only do depth=1 and run a series of crawls at
depth=1), can I safely delete old segment/nnnnn folder before generating new
one??
Regards,
Hilkiah G. Lavinier MEng (Hons), ACGI
6 Winston Lane,
Goodwill,
Roseau, Dominica
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201 / AOL hilkiah21
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now.
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ