Nutch Developers,

Doesn't following a redirect invalidate the MapReduce URL partitioning scheme? For example, suppose that TaskTracker-A has been assigned all URLs on domain1.com and TaskTracker-B has been assigned all URLs on domain2.com. TaskTracker-A is now busy fetching pages from domain1.com. Suppose, however, that some of the URLs on domain2.com redirect to domain1.com. Thus, both TaskTracker-A and TaskTracker-B could be hitting the same IP address.

I'm not sure how significant such a problem might be, as it would probably be rare for such a redirection to result in simultaneous access to the same web server.

When we implemented our (IP-based) partitioning strategy in Nutch-0.7, we didn't follow redirects that pointed to a different IP address. Instead, we just made sure that the new URLs went into the web DB and were marked appropriately so that they would get fetched the very next time around our generate/fetch/update loop. We also copied the score from the record describing the old URL, though doing the same kind of thing in Nutch-0.8 would probably affect the performance of the OPIC algorithm for these cases.

Stefan Groschupf has also pointed out that we should try to honor http.redirect.max, even in this case, since this avoids sites that can redirect back and forth between two locations indefinitely.

Thoughts?

- Chris

--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to