Re: Generate Normalizations, Resolving IP addresses, and Duplicates

Andrzej Bialecki Wed, 06 Feb 2008 15:46:55 -0800

Dennis Kubes wrote:

I have been working on improving the Generator for the last couple ofdays and here are the discussion areas I have come up with so far:
1) Would resolving IP addresses inside of the generator be useful? If weare limiting the number of urls to fetch then this would allow us toremove UnknownHosts before hand, essentially giving us a better fetchlist. Cons are it could as much as double the DNS load as it ishappening once during generate and once during fetching. As it is I amworking on a patch that give the option to either resolve it or not.

We had a discussion on this in the past. IIRC here are the issues withearly IP resolution:


* hostname->ip mapping may change rapidly (round-robin DNS),

* resolving IPs in Generator practicaly enforces that even smallinstallations must use local caching DNS servers - otherwise thecumulative DNS traffic created by Generator may be too high.

* fetcher would need a way to re-use this resolved IP so that we don'tdo the same lookup twice, i.e. we would have to implement a DNS providerthat can use the resolved (and presumably saved) IPs during fetching.


All in all, it may be worth it, or perhaps it might not. ;)

2) Normalization of urls inside of generate. Currently in the reducemethod of Selector inside of generate there is a normalization call.Personally I think this is in the wrong place. I think this should be inselector map method. As it is currently, the normalizer doesn't haveany effect anyways because we are not collecting a changed url (a bug).
If we were to put the normalizer in the map method, then the possibilityfor duplicate urls from normalization arises as well. If I am notmistaken this would need another MR job at the end of generator toremove duplicates.
Another options would be just to NOT have the normalization optioninside of Generator. Is there a good reason to have normalization inGenerator?

Well, I think the idea here was that the normalization rules could bevery dynamic, i.e. they could change dramatically between subsequentruns of Generator - although I must say that I don't see this happeningin practice ...

Perhaps Nutch should work on the following assumption: urls that arefound in Crawldb are guaranteed to be normalized. If differentnormalization rules are needed then the crawldb needs to be explicitlyfiltered in a separate step, using CrawlDbMerger tool.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Generate Normalizations, Resolving IP addresses, and Duplicates

Reply via email to