[
https://issues.apache.org/jira/browse/NUTCH-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13217172#comment-13217172
]
Mathijs Homminga commented on NUTCH-1289:
-----------------------------------------
Nice catch. The PartitionUrlByHost seems broken indeed.
I would suggest that we use the existing o.a.n.crawl.URLPartitioner class which
has support for three URL partition modes (host, domain, IP) and which is used
by the GeneratorJob too.
Pros: support for different partition modes in the Fetcher + no duplicate code.
Or is there a reason why the Fetcher has its own partition logic?
The URLPartitioner class is a Partitioner<SelectorEntry, WebPage> instead of a
Partitioner<IntWritable, FetchEntry> but you can perhaps extract a method and
use it from both classes, or create one URLPartitioner with two specific inner
classes for the Generator and Fetcher.
> In distributed mode URL's are not partitioned
> ---------------------------------------------
>
> Key: NUTCH-1289
> URL: https://issues.apache.org/jira/browse/NUTCH-1289
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: nutchgora
> Reporter: Dan Rosher
> Fix For: nutchgora
>
> Attachments: NUTCH-1289.patch
>
>
> In distributed mode URL's are not partitioned to a specific machine which
> means the politeness policy is voided
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira