[jira] Updated: (NUTCH-601) Recrawling on existing crawl directory using force option

Susam Pal (JIRA) Fri, 15 Feb 2008 12:58:31 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Susam Pal updated NUTCH-601:
----------------------------

    Attachment: NUTCH-601v1.0.patch

Attached another patch (NUTCH-601v1.0.patch) that always deletes the old mergex 
index as per the suggestion of Andrzej.

The v0.4 patch would leave the old merged index with the new segments in case 
something goes wrong during the generation of new index. Whether the index 
merger fails or succeeds, we will always have an 'index' directory. So, after 
the completion of a recrawl, a user may want to verify whether the 'index' 
directory is the new merged index or the old merged index. This may be 
confusing.

However, one advantage is that one can run a recrawl on the same crawl 
directory which the web-gui is using to serve the users. This patch minimizes 
the duration for which the index directory would be unavailable.

The v1.0 patch always deletes the old indexes as well as old merged index. 
Therefore, the old index would never remain once the index generation has 
begun. If the index merger fails, we won't have an 'index' directory which 
would be a clear indication of index generation failure. This prevents the 
confusion discussed above.

Please review both the patches and accept whichever the community feels is 
better.

> Recrawling on existing crawl directory using force option
> ---------------------------------------------------------
>
>                 Key: NUTCH-601
>                 URL: https://issues.apache.org/jira/browse/NUTCH-601
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.0.0
>            Reporter: Susam Pal
>            Priority: Minor
>         Attachments: NUTCH-601v0.1.patch, NUTCH-601v0.2.patch, 
> NUTCH-601v0.3.patch, NUTCH-601v1.0.patch
>
>
> Added a '-force' option to the 'bin/nutch crawl' command line. With this 
> option, one can crawl and recrawl in the following manner:
> {code}
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> {code}
> This option can be used for the first crawl too:
> {code}
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
> {code}
> If one tries to crawl without the -force option when the crawl directory 
> already exists, he/she finds a small warning along with the error message:
> {code}
> # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
> Exception in thread "main" java.lang.RuntimeException: crawl already
> exists. Add -force option to recrawl.
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-601) Recrawling on existing crawl directory using force option

Reply via email to