hmm yea, in my case I merge the indexes into a temp dir, then I delete the existing index dir and move mine in there: rm -rf crawl/MERGEDindexes $NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes # in nutch-site, hadoop.tmp.dir points to crawl/tmp rm -rf crawl/tmp/* # we have to stop tomcat because sometimes it is still accessing the index file sudo /etc/init.d/tomcat5.5 stop # replace indexes with indexes_merged rm -rf crawl/OLDindexes mv --verbose crawl/index crawl/OLDindexes mv --verbose crawl/MERGEDindexes crawl/index
echo "----- Restarting Tomcat (Step 10 of $steps) -----" sudo /etc/init.d/tomcat5.5 start I also found I have to stop the tomcat service otherwise I can't delete the index files. You may not need to do this if you aren't using Tomcat. -Jeff >>> [EMAIL PROTECTED] 9/19/2007 10:34 AM >>> The recrawl script for 0.9 I found in http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works first time successfully. Second time, it fails with this error. merging indexes to: crawl/index IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory crawl/index already exists! at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74) at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111) I am trying this with the latest version available in trunk. Please help me to rectify this.
