[ 
https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217172#comment-13217172
 ] 

Mathijs Homminga commented on NUTCH-1289:
-----------------------------------------

Nice catch. The PartitionUrlByHost seems broken indeed.
I would suggest that we use the existing o.a.n.crawl.URLPartitioner class which 
has support for three URL partition modes (host, domain, IP) and which is used 
by the GeneratorJob too.

Pros: support for different partition modes in the Fetcher + no duplicate code.
Or is there a reason why the Fetcher has its own partition logic?

The URLPartitioner class is a Partitioner<SelectorEntry, WebPage> instead of a 
Partitioner<IntWritable, FetchEntry> but you can perhaps extract a method and 
use it from both classes, or create one URLPartitioner with two specific inner 
classes for the Generator and Fetcher.

                
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
>                 Key: NUTCH-1289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: nutchgora
>            Reporter: Dan Rosher
>             Fix For: nutchgora
>
>         Attachments: NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which 
> means the politeness policy is voided

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to