Re: Generate Normalizations, Resolving IP addresses, and Duplicates

Dennis Kubes Wed, 06 Feb 2008 16:42:32 -0800


Andrzej Bialecki wrote:

Dennis Kubes wrote:
I have been working on improving the Generator for the last couple ofdays and here are the discussion areas I have come up with so far:
1) Would resolving IP addresses inside of the generator be useful? Ifwe are limiting the number of urls to fetch then this would allow usto remove UnknownHosts before hand, essentially giving us a betterfetch list. Cons are it could as much as double the DNS load as it ishappening once during generate and once during fetching. As it is Iam working on a patch that give the option to either resolve it or not.
We had a discussion on this in the past. IIRC here are the issues withearly IP resolution:
* hostname->ip mapping may change rapidly (round-robin DNS),

True but in this instance the goal was just to avoid pulling urls withunknown DNS into the fetchlist so we can get the maximum amount of goodurls to fetch. We are trading time for quality.

* resolving IPs in Generator practicaly enforces that even smallinstallations must use local caching DNS servers - otherwise thecumulative DNS traffic created by Generator may be too high.

If this is an option that is set to false by default then it shouldaffect current behavior and it should give those of us with properresources the ability to filter out if desired. I guess on one hand Iam thinking that having the option is better than not having it, maybenot. Do you think it would be bad to have the option?

* fetcher would need a way to re-use this resolved IP so that we don'tdo the same lookup twice, i.e. we would have to implement a DNS providerthat can use the resolved (and presumably saved) IPs during fetching.

We could start with an option to have it there (not saved), then move italong to a more developed solution later :)

All in all, it may be worth it, or perhaps it might not. ;)
2) Normalization of urls inside of generate. Currently in the reducemethod of Selector inside of generate there is a normalization call.Personally I think this is in the wrong place. I think this should bein selector map method. As it is currently, the normalizer doesn'thave any effect anyways because we are not collecting a changed url (abug).
If we were to put the normalizer in the map method, then thepossibility for duplicate urls from normalization arises as well. IfI am not mistaken this would need another MR job at the end ofgenerator to remove duplicates.
Another options would be just to NOT have the normalization optioninside of Generator. Is there a good reason to have normalization inGenerator?
Well, I think the idea here was that the normalization rules could bevery dynamic, i.e. they could change dramatically between subsequentruns of Generator - although I must say that I don't see this happeningin practice ...

Currently there is a normalization in the selector reduce. But thatnormalized url never updates the entry.url, which is a bug, so is itbetter to just remove the normalization or leave it. Here is the kickerthough, if we leave it, it is possible that duplicate urls will make itthrough to the fetchlist.

Perhaps Nutch should work on the following assumption: urls that arefound in Crawldb are guaranteed to be normalized. If differentnormalization rules are needed then the crawldb needs to be explicitlyfiltered in a separate step, using CrawlDbMerger tool.

We have been talking about this lately. It seems like there needs to betwo more tools or extension points.

One is a url translator. Something that would say url A is actually urlB. Anytime that url is seen it would be translated to its new form.This would help with cases such as java.net vs www.java.net being twoseparate urls. Jobs could be written to translate based on hash orother values.

The second tool would be a tool that more completely manipulates thecrawldb (maybe this is the crawldbmerger). This would allow things likeresetting the crawl dates on all urls for the crawldb, normalizingscores, things like that that are global operations on the entirecrawldb or a subset of the urls in the crawldb.


Dennis

Re: Generate Normalizations, Resolving IP addresses, and Duplicates

Reply via email to