Re: Generate Normalizations, Resolving IP addresses, and Duplicates

Andrzej Bialecki Thu, 07 Feb 2008 05:37:37 -0800

Dennis Kubes wrote:

* resolving IPs in Generator practicaly enforces that even smallinstallations must use local caching DNS servers - otherwise thecumulative DNS traffic created by Generator may be too high.
If this is an option that is set to false by default then it shouldaffect current behavior and it should give those of us with properresources the ability to filter out if desired. I guess on one hand Iam thinking that having the option is better than not having it, maybenot. Do you think it would be bad to have the option?

Sure, we can do that, and set it to false by default - with propercaveats in the documentation. It's akin to the issue ofgenerate.max.host.by.ip, where the code needs to resolve host names ifthis option is set - and in my experience if you turn this on, only agood two-level DNS cache can save you ;)

* fetcher would need a way to re-use this resolved IP so that we don'tdo the same lookup twice, i.e. we would have to implement a DNSprovider that can use the resolved (and presumably saved) IPs duringfetching.
We could start with an option to have it there (not saved), then move italong to a more developed solution later :)

Hmmm. IIRC it's not that difficult to write a DNS resolver provider, andsome implementations already exist (dnsjava). We just need to make surethat the new code takes this into account, so that it would be able touse the DNS resolver _if_ one were available.

Well, I think the idea here was that the normalization rules could bevery dynamic, i.e. they could change dramatically between subsequentruns of Generator - although I must say that I don't see thishappening in practice ...
Currently there is a normalization in the selector reduce. But thatnormalized url never updates the entry.url, which is a bug, so is itbetter to just remove the normalization or leave it. Here is the kickerthough, if we leave it, it is possible that duplicate urls will make itthrough to the fetchlist.

That is assuming that normalization rules have changed since the lastupdatedb - because urls that are being added to crawldb are alreadypassed through the normalizers, so among all urls in the crawldb thereshould be no duplicates.

Perhaps Nutch should work on the following assumption: urls that arefound in Crawldb are guaranteed to be normalized. If differentnormalization rules are needed then the crawldb needs to be explicitlyfiltered in a separate step, using CrawlDbMerger tool.
We have been talking about this lately. It seems like there needs to betwo more tools or extension points.
One is a url translator. Something that would say url A is actually urlB. Anytime that url is seen it would be translated to its new form.This would help with cases such as java.net vs www.java.net being twoseparate urls. Jobs could be written to translate based on hash orother values.

I agree - I think an extension point would be ideal here, because weneed to determine the "aliases" in many places during the regularcrawling cycle. This is closely related to the issue of handlingredirects and to the static scoring.

The second tool would be a tool that more completely manipulates thecrawldb (maybe this is the crawldbmerger). This would allow things likeresetting the crawl dates on all urls for the crawldb, normalizingscores, things like that that are global operations on the entirecrawldb or a subset of the urls in the crawldb.

Yes, a sort of CrawlDbAdmin tool ... we could use the CrawlDbMerger asone of the components. This would be a welcome contribution if someonewere to write this tool ;)

The DB management in Nutch is still somewhat inconvenient (perhaps thiswill change when / if we start using HBase ;) ).

In the meantime, I have this wild idea of a tool, that consists of askeleton map-reduce job, tailored to consume and produce crawldb /linkdb, and a set of Beanshell scripts that plug in to map() andreduce() and perform basic admin tasks like filtering, normalization,setting values, adding / removing metadata etc, etc. This way userscould relatively easily extend the administrative functions just bywriting scripts.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Generate Normalizations, Resolving IP addresses, and Duplicates

Reply via email to