[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
-----------------------------------

    Attachment: ipInCrawlDatumDraftV1.patch

To keep the discussion alive attached a _first draft_ for storing the ip in the 
crawlDatum for public discussion.

Some notes. 
The IP is stored as byte[] in the crawlDatum itself not in the meta data.
There is a IpAddressResolver maprunnable tool to update a crawlDb using 
multithreaded ip lookups.
In case a IP is available in the crawlDatum the Generator use the "cached" ip. 

To discuss:
I don't like the idea of post process the complete crawlDb any time after a 
update. 
Processing crawlDb is expansive in storage usage and time. 
We can have a property "ipLookups" with possible values 
<never|duringParsing|postUpdateDb>.
Than we can add also some code to lookup the IP in the ParseOutputFormat as 
discussed or we start IpAddressResolver as job in the updateDb tool code.

In the moment I write the ip address bytes like this:
out.writeInt(ipAddress.length);
out.write(ipAddress); 
I think for now we can define that byte[] ipAddress is everytime 4 bytes long, 
or should we be IPv6 compatible by today?

Please give me some comments I have a strong interest to get this issue fixed 
asap and I'm willing to improve things as required. :-)

> CrawlDatum should store IP address
> ----------------------------------
>
>          Key: NUTCH-289
>          URL: http://issues.apache.org/jira/browse/NUTCH-289
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Doug Cutting
>  Attachments: ipInCrawlDatumDraftV1.patch
>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to