[
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835056#comment-13835056
]
Walter Tietze commented on NUTCH-1360:
--------------------------------------
Hi Lewis,
according to the mail I sent to you, I provide my patch for storing ip
addresses in apache-nutch-1.5.1 as attachment.
( https://issues.apache.org/jira/browse/NUTCH-289 might also be appropriate!)
In our project MIA (http://mia-marktplatz.de/) we spider the german www. To
stay polite we had to switch to a 'byIP' policy to guarantee request
frequencies of at least one minute per server. Crawling 'byHost' was no option,
because many sites use up to some thousand subdomains hosted at a single server
with one ip address.
In proceeding with our crawl I realized that crawling by IP seemed to slow
down, because in the process of generating the url lists nutch has to determine
the ip address to build up the queues for urls according to their ip addresses.
This solution is a simple solution which writes the once determined ip address
into the metadata field of the CrawlDatum object. When a crawl cycle has
finished its fetch job an additional map-reduce job is started to determine the
ip addresses of newly fetched and parsed urls. New urls are inserted into the
crawldb with their ip addresses if an ip address could have been determined.
In this solution there exist also the two classes IpAddressResolver.java and
DNSCache.java which cache already fetched ip addresses from the DNS and control
the number of concurrent calls to the DNS from each map job. Since many urls
with the same ip address should be generated into a queue I wanted to minimize
the load which is taken to build up the queues. Caching ip addresses in-memory
shouldn't be memory-consuming. To avoid to many concurrent requests to a DNS
from the crawler, I added some code to restrict the number of parallel requests
to the DNS.
I use this piece of code in production since about three-quarters this year and
it seems to work fine. The four configuration entries should be
self-explaining.
Cheers, Walter
> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
> Key: NUTCH-1360
> URL: https://issues.apache.org/jira/browse/NUTCH-1360
> Project: Nutch
> Issue Type: New Feature
> Components: protocol
> Affects Versions: nutchgora, 1.5
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1360-nutchgora-v2.patch,
> NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch,
> NUTCH-1360v4.patch, NUTCH-1360v5.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which
> we connect to to fetch a page.
--
This message was sent by Atlassian JIRA
(v6.1#6144)