Nutch Developers,
Doesn't following a redirect invalidate the MapReduce URL
partitioning scheme? For example, suppose that TaskTracker-A has been
assigned all URLs on domain1.com and TaskTracker-B has been assigned
all URLs on domain2.com. TaskTracker-A is now busy fetching pages
from domain1.com. Suppose, however, that some of the URLs on
domain2.com redirect to domain1.com. Thus, both TaskTracker-A and
TaskTracker-B could be hitting the same IP address.
I'm not sure how significant such a problem might be, as it would
probably be rare for such a redirection to result in simultaneous
access to the same web server.
When we implemented our (IP-based) partitioning strategy in
Nutch-0.7, we didn't follow redirects that pointed to a different IP
address. Instead, we just made sure that the new URLs went into the
web DB and were marked appropriately so that they would get fetched
the very next time around our generate/fetch/update loop. We also
copied the score from the record describing the old URL, though doing
the same kind of thing in Nutch-0.8 would probably affect the
performance of the OPIC algorithm for these cases.
Stefan Groschupf has also pointed out that we should try to honor
http.redirect.max, even in this case, since this avoids sites that
can redirect back and forth between two locations indefinitely.
Thoughts?
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers