[
https://issues.apache.org/jira/browse/NUTCH-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-396.
-------------------------------
Resolution: Won't Fix
> mergesegs sorts URLs, making segments useless for subsequent fetch
> ------------------------------------------------------------------
>
> Key: NUTCH-396
> URL: https://issues.apache.org/jira/browse/NUTCH-396
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 0.8
> Environment: Mac OS X 10.4.7
> Reporter: Doug Cook
> Priority: Minor
>
> Mergesegs leaves the output segment in URL-sorted order.
> This is a problem if the segment was just generated and not yet fetched - the
> fetcher likes the URLs to be in essentially random order (sort by URL hash or
> similar). If I fetch a segment created by mergesegs, my performance is
> extremely poor since all URLs from a given host will be grouped together and
> the per-host delays kill me.
> I have a local fix which I am using: map using a key of MD5(URL) + URL, then,
> during the reduce phase, chop the MD5 off the front to get the original URL.
> This is simple, has essentially random order, no problems with collisions,
> and seems to work nicely.
> The only thing I don't know is whether or not there is some other tool
> expecting the sorted order (I would expect not, since generate does not
> produce this). Right now I have my fix as an option (-randomize), but if
> there is no other tool requiring sorted order, it's probably cleaner to just
> make this non-optional.
> Thoughts?
>
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira