Dennis Kubes wrote:

* resolving IPs in Generator practicaly enforces that even small installations must use local caching DNS servers - otherwise the cumulative DNS traffic created by Generator may be too high.

If this is an option that is set to false by default then it should affect current behavior and it should give those of us with proper resources the ability to filter out if desired. I guess on one hand I am thinking that having the option is better than not having it, maybe not. Do you think it would be bad to have the option?


Sure, we can do that, and set it to false by default - with proper caveats in the documentation. It's akin to the issue of generate.max.host.by.ip, where the code needs to resolve host names if this option is set - and in my experience if you turn this on, only a good two-level DNS cache can save you ;)




* fetcher would need a way to re-use this resolved IP so that we don't do the same lookup twice, i.e. we would have to implement a DNS provider that can use the resolved (and presumably saved) IPs during fetching.

We could start with an option to have it there (not saved), then move it along to a more developed solution later :)

Hmmm. IIRC it's not that difficult to write a DNS resolver provider, and some implementations already exist (dnsjava). We just need to make sure that the new code takes this into account, so that it would be able to use the DNS resolver _if_ one were available.



Well, I think the idea here was that the normalization rules could be very dynamic, i.e. they could change dramatically between subsequent runs of Generator - although I must say that I don't see this happening in practice ...

Currently there is a normalization in the selector reduce. But that normalized url never updates the entry.url, which is a bug, so is it better to just remove the normalization or leave it. Here is the kicker though, if we leave it, it is possible that duplicate urls will make it through to the fetchlist.

That is assuming that normalization rules have changed since the last updatedb - because urls that are being added to crawldb are already passed through the normalizers, so among all urls in the crawldb there should be no duplicates.


Perhaps Nutch should work on the following assumption: urls that are found in Crawldb are guaranteed to be normalized. If different normalization rules are needed then the crawldb needs to be explicitly filtered in a separate step, using CrawlDbMerger tool.

We have been talking about this lately. It seems like there needs to be two more tools or extension points.

One is a url translator. Something that would say url A is actually url B. Anytime that url is seen it would be translated to its new form. This would help with cases such as java.net vs www.java.net being two separate urls. Jobs could be written to translate based on hash or other values.

I agree - I think an extension point would be ideal here, because we need to determine the "aliases" in many places during the regular crawling cycle. This is closely related to the issue of handling redirects and to the static scoring.


The second tool would be a tool that more completely manipulates the crawldb (maybe this is the crawldbmerger). This would allow things like resetting the crawl dates on all urls for the crawldb, normalizing scores, things like that that are global operations on the entire crawldb or a subset of the urls in the crawldb.

Yes, a sort of CrawlDbAdmin tool ... we could use the CrawlDbMerger as one of the components. This would be a welcome contribution if someone were to write this tool ;)

The DB management in Nutch is still somewhat inconvenient (perhaps this will change when / if we start using HBase ;) ).

In the meantime, I have this wild idea of a tool, that consists of a skeleton map-reduce job, tailored to consume and produce crawldb / linkdb, and a set of Beanshell scripts that plug in to map() and reduce() and perform basic admin tasks like filtering, normalization, setting values, adding / removing metadata etc, etc. This way users could relatively easily extend the administrative functions just by writing scripts.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to