[
https://issues.apache.org/jira/browse/NUTCH-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-570.
-------------------------------
Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira
> Improvement of URL Ordering in Generator.java
> ---------------------------------------------
>
> Key: NUTCH-570
> URL: https://issues.apache.org/jira/browse/NUTCH-570
> Project: Nutch
> Issue Type: Improvement
> Components: generator
> Reporter: Ned Rockson
> Assignee: Otis Gospodnetic
> Priority: Minor
> Attachments: GeneratorDiff.out, GeneratorDiff_v1.out
>
>
> [Copied directly from my email to nutch-dev list]
> Recently I switched to Fetcher2 over Fetcher for larger whole web fetches
> (50-100M at a time). I found that the URLs generated are not optimal because
> they are simply randomized by a hash comparator. In one crawl on 24 machines
> it took about 3 days to crawl 30M URLs. In comparison with old benchmarks I
> had set with regular Fetcher.java this was at least 3 fold more time.
> Anyway, I realized that the best situation for ordering can be approached by
> randomization, but in order to get optimal ordering, urls from the same host
> should be as far apart in the list as possible. So I wrote a series of 2
> map/reduces to optimize the ordering and for a list of 25M documents it takes
> about 10 minutes on our cluster. Right now I have it in its own class, but I
> figured it can go in Generator.java and just add a flag in nutch-default.xml
> determining if the user wants to use it.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira