mergesegs sorts URLs, making segments useless for subsequent fetch
------------------------------------------------------------------

                 Key: NUTCH-396
                 URL: http://issues.apache.org/jira/browse/NUTCH-396
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 0.8
         Environment: Mac OS X 10.4.7
            Reporter: Doug Cook
            Priority: Minor


Mergesegs leaves the output segment in URL-sorted order.

This is a problem if the segment was just generated and not yet fetched - the 
fetcher likes the URLs to be in essentially random order (sort by URL hash or 
similar). If I fetch a segment created by mergesegs, my performance is 
extremely poor since all URLs from a given host will be grouped together and 
the per-host delays kill me.

I have a local fix which I am using: map using a key of MD5(URL) + URL, then, 
during the reduce phase, chop the MD5 off the front to get the original URL. 
This is simple, has essentially random order, no problems with collisions, and 
seems to work nicely.

The only thing I don't know is whether or not there is some other tool 
expecting the sorted order (I would expect not, since generate does not produce 
this). Right now I have my fix as an option (-randomize), but if there is no 
other tool requiring sorted order, it's probably cleaner to just make this 
non-optional.

Thoughts?

 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
  • [jira] Created: (NUTCH-396) mergesegs sorts URLs, making ... Doug Cook (JIRA)

Reply via email to