Hi,

I think I've come across an issue with the way hadoop lists files.  Or maybe 
its just me...anyway, I'm using a modified version of the crawl script found on 
the wiki site.  I'm trying to ensure that the fetch operation always uses the 
latest segment generated by the generate operation.  thus  the code looks like :


 echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"

  echo "** $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays 
$adddays"
  $NUTCH generate $crawl/crawldb $crawl/segments $topN -adddays $adddays

  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi

  #debug
  ls -l --sort=t -r $crawl/segments
  $HADOOP dfs -ls $crawl/segments

  segment=`$HADOOP dfs -ls $crawl/segments | tail -1|sed -e 
's/\/.*\/.*\/\(.*\/.*\/[0-9]*\).*/\1/'`
  echo "** segment: $segment"

  echo "** $NUTCH fetch $segment -threads $threads"
  $NUTCH fetch $segment -threads $threads



However every so often the crawl fails as per :

--- Beginning crawl at depth 1 of 1 ---
** /nutch/nutch-1.0-dev/bin/nutch generate crawl/crawldb crawl/segments  
-adddays 24
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080417204644
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
generate return value: 0
total 8
drwxr-xr-x 7 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204628
drwxr-xr-x 3 hilkiah hilkiah 4096 2008-04-17 20:46 20080417204644
Found 2 items
/nutch/search/crawl/segments/20080417204644     <dir>           2008-04-17 
20:46        rwxr-xr-x       hilkiah hilkiah
/nutch/search/crawl/segments/20080417204628     <dir>           2008-04-17 
20:46        rwxr-xr-x       hilkiah hilkiah
** segment: crawl/segments/20080417204628
** /nutch/nutch-1.0-dev/bin/nutch fetch crawl/segments/20080417204628 -threads 
1000
Fetcher: starting
Fetcher: segment: crawl/segments/20080417204628
Fetcher: java.io.IOException: Segment already fetched!
        at 
org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(FetcherOutputFormat.java:49)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:540)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:805)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:524)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:559)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:531)



I think the problem is that hadoop doesn't return the latest segment directory 
in chronological orde (or alphanumeric order).  First, is this a known issue 
and if so, how do I work around it?  Secondly, since I believe I 
refetch/reindex all pages (I only do depth=1 and run a series of crawls at 
depth=1), can I safely delete old segment/nnnnn folder before generating new 
one??

Regards,
 
Hilkiah G. Lavinier MEng (Hons), ACGI 
6 Winston Lane, 
Goodwill, 
Roseau, Dominica

Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487


Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201  / AOL hilkiah21





      
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Reply via email to