[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrzej Bialecki updated NUTCH-570:
------------------------------------
Patch Info: [Patch Available]
> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
> Key: NUTCH-570
> URL: https://issues.apache.org/jira/browse/NUTCH-570
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Ned Rockson
> Priority: Minor
> Attachments: GeneratorDiff.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches
> (50-100M at a time). I found that the URLs generated are not optimal because
> they are simply randomized by a hash comparator. In one crawl on 24 machines
> it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I
> had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by
> randomization, but in order to get optimal ordering, urls from the same host
> should be as far apart in the list as possible. So I wrote a series of 2
> map/reduces to optimize the ordering and for a list of 25M documents it takes
> about 10 minutes on our cluster. Right now I have it in its own class, but I
> figured it can go in Generator.java and just add a flag in nutch-default.xml
> determining if the user wants to use it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.