[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Walter Tietze (JIRA) Thu, 28 Nov 2013 11:58:38 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835056#comment-13835056
 ]


Walter Tietze commented on NUTCH-1360:
--------------------------------------

Hi Lewis,

according to the mail I sent to you, I provide my patch for storing ip 
addresses in apache-nutch-1.5.1 as attachment.

( https://issues.apache.org/jira/browse/NUTCH-289 might also be appropriate!)

In our project MIA (http://mia-marktplatz.de/) we spider the german www. To 
stay polite we had to switch to a 'byIP' policy to guarantee request 
frequencies of at least one minute per server. Crawling 'byHost' was no option, 
because many sites use up to some thousand subdomains hosted at a single server 
with one ip address. 
In proceeding with our crawl I realized that crawling by IP seemed to slow 
down, because in the process of generating the url lists nutch has to determine 
the ip address to build up the queues for urls according to their ip addresses. 

This solution is a simple solution which writes the once determined ip address 
into the metadata field of the CrawlDatum object. When a crawl cycle has 
finished its fetch job an additional map-reduce job is started to determine the 
ip addresses  of newly fetched and parsed urls. New urls are inserted into the 
crawldb with their ip addresses if an ip address could have been determined.

In this solution there exist also the two classes IpAddressResolver.java and 
DNSCache.java which cache already fetched ip addresses from the DNS and control 
the number of concurrent calls to the DNS from each map job. Since many urls 
with the same ip address should be generated into a queue I wanted to minimize 
the load which is taken to build up the queues. Caching ip addresses in-memory 
shouldn't be memory-consuming. To avoid to many concurrent requests to a DNS 
from the crawler, I added some code to restrict the number of parallel requests 
to the DNS.

I use this piece of code in production since about three-quarters this year and 
it seems to work fine. The four configuration entries should be 
self-explaining. 

Cheers, Walter

 







> Suport the storing of IP address connected to when web crawling
> ---------------------------------------------------------------
>
>                 Key: NUTCH-1360
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1360
>             Project: Nutch
>          Issue Type: New Feature
>          Components: protocol
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1360-nutchgora-v2.patch, 
> NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch, NUTCH-1360v3.patch, 
> NUTCH-1360v4.patch, NUTCH-1360v5.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which 
> we connect to to fetch a page.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

Reply via email to