Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

Alexis Votta Wed, 19 Sep 2007 12:20:55 -0700

Hi Jeff...

Your block of code comes from Nutch 0.9 crawl script which is a
different article. http://wiki.apache.org/nutch/Crawl I am facing the
problem with Nutch 0.9 recrawl script which I found in this article =>
http://wiki.apache.org/nutch/IntranetRecrawl


Even if I follow your approach, I am losing index of previous crawl.
You are merging the new indexes only in this line:

NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes

I want to merge the new indexes with the old index which nutch 0.9
recrawl wants to do but it fails with this error.

merging indexes to: crawl/index
IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory crawl/index already exists!
       at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74)
       at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148)
       at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
       at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111)

Has anyone used the re-crawl script successfully with trunk? Does it
work for nutch 0.9?

On 9/20/07, Jeff Van Boxtel <[EMAIL PROTECTED]> wrote:
> hmm yea, in my case I merge the indexes into a temp dir, then I delete
> the existing index dir and move mine in there:
>
> rm -rf crawl/MERGEDindexes
> $NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes crawl/NEWindexes
>
> # in nutch-site, hadoop.tmp.dir points to crawl/tmp
> rm -rf crawl/tmp/*
>
> # we have to stop tomcat because sometimes it is still accessing the
> index file
> sudo /etc/init.d/tomcat5.5 stop
>
> # replace indexes with indexes_merged
> rm -rf crawl/OLDindexes
> mv --verbose crawl/index crawl/OLDindexes
> mv --verbose crawl/MERGEDindexes crawl/index
>
> echo "----- Restarting Tomcat (Step 10 of $steps) -----"
> sudo /etc/init.d/tomcat5.5 start
>
> I also found I have to stop the tomcat service otherwise I can't delete
> the index files. You may not need to do this if you aren't using
> Tomcat.
>
> -Jeff
>
> >>> [EMAIL PROTECTED] 9/19/2007 10:34 AM >>>
>
> The recrawl script for 0.9 I found in
> http://wiki.apache.org/nutch/IntranetRecrawl is not working. It works
> first time successfully. Second time, it fails with this error.
>
> merging indexes to: crawl/index
> IndexMerger: org.apache.hadoop.mapred.FileAlreadyExistsException:
> Output directory crawl/index already exists!
>         at
> org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:74)
>         at
> org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:148)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at
> org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:111)
>
> I am trying this with the latest version available in trunk. Please
> help me to rectify this.
>
>

Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help

Reply via email to