Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

Jeff Van Boxtel Wed, 19 Sep 2007 12:04:13 -0700

hmm yea, in my case I merge the indexes into a temp dir, then I delete
the existing index dir and move mine in there:
 
rm -rf crawl/MERGEDindexes
$NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes
 
# in nutch-site, hadoop.tmp.dir points to crawl/tmp
rm -rf crawl/tmp/*
 
# we have to stop tomcat because sometimes it is still accessing the
index file
sudo /etc/init.d/tomcat5.5 stop
 
# replace indexes with indexes_merged
rm -rf crawl/OLDindexes
mv --verbose crawl/index crawl/OLDindexes
mv --verbose crawl/MERGEDindexes crawl/index


echo "----- Restarting Tomcat (Step 10 of $steps) -----"
sudo /etc/init.d/tomcat5.5 start
 
I also found I have to stop the tomcat service otherwise I can't delete
the index files. You may not need to do this if you aren't using
Tomcat.
 
-Jeff

>>> [EMAIL PROTECTED] 9/19/2007 10:34 AM >>>

The recrawl script for 0.9 I found in
http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works
first time successfully. Second time, it fails with this error.

merging indexes to: crawl/index
IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory crawl/index already exists!
        at
org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74)
        at
org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at
org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111)

I am trying this with the latest version available in trunk. Please
help me to rectify this.

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

Reply via email to