[ http://issues.apache.org/jira/browse/NUTCH-268?page=comments#action_12383327 ]
Andrzej Bialecki commented on NUTCH-268: ----------------------------------------- I forgot to add: if we change Generator to use IP addresses, then we should warn users that running a local caching DNS server becomes practically mandatory - otherwise Generator would be very slow, not to mention that it would generate a lot of DNS traffic to external servers. > Generator and lib-http use different definitions of "unique host" > ----------------------------------------------------------------- > > Key: NUTCH-268 > URL: http://issues.apache.org/jira/browse/NUTCH-268 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Fix For: 0.8-dev > > Generator uses a host name, as extracted from URL, to determine the maximum > number of URLs from a unique host (when generator.max.per.host is set > 0). > This supposedly should prevent the situation where fetchlists become > dominated by URLs coming from the same hosts, which in turn would clash with > "politeness" rules. > However, http plugins (lib-http HttpBase.blockAddr) don't use host name, and > instead use it's IP address (explicitly doing a DNS lookup on the host name > extracted from URL). This leads to the following undesirable behavior: > * if DNS name resolves to different IPs (round-robin balancing), then > technically we are in violation of the "politeness" rules, because lib-http > doesn't see this as a conflict and permits concurrent accesses to the same > host name. > * if different DNS names resolve to the same IP address (very common: > CNAME-s, subdomains, web hosting, etc) then the purpose of > generate.max.per.host is defeated, because lib-http will block more > frequently than intended, leading to excessive numbers of "Exceeded > http.max.delays" exceptions. > Proposed solution: synchronize Generator and lib-http in their interpretation > of "unique host". Introduce a boolean property which instructs both Generator > and lib-http to use in both places either IP addresses or host names as > "unique hosts". -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
