[ 
https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515041
 ] 

Ian Holsman commented on NUTCH-524:
-----------------------------------

Hi Dogacan.

we need this setting as we have the situation where we have a single host which 
has millions of URLs/files, and it is impossible for a single machine to crawl 
it in a adequate amount of time.

In this case web politeness isn't an issue, as we also own the site in 
question, and we know it can handle the load

We thought that other large sites might also run into this issue, so we might 
it into a config option

> Generate Problem with Single Node
> ---------------------------------
>
>                 Key: NUTCH-524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-524
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Daniel Clark
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a 
> cluster of two or more machines.  I will provide a fix for this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to