Dennis Kubes wrote:
* resolving IPs in Generator practicaly enforces that even small
installations must use local caching DNS servers - otherwise the
cumulative DNS traffic created by Generator may be too high.
If this is an option that is set to false by default then it should
affect current behavior and it should give those of us with proper
resources the ability to filter out if desired. I guess on one hand I
am thinking that having the option is better than not having it, maybe
not. Do you think it would be bad to have the option?
Sure, we can do that, and set it to false by default - with proper
caveats in the documentation. It's akin to the issue of
generate.max.host.by.ip, where the code needs to resolve host names if
this option is set - and in my experience if you turn this on, only a
good two-level DNS cache can save you ;)
* fetcher would need a way to re-use this resolved IP so that we don't
do the same lookup twice, i.e. we would have to implement a DNS
provider that can use the resolved (and presumably saved) IPs during
fetching.
We could start with an option to have it there (not saved), then move it
along to a more developed solution later :)
Hmmm. IIRC it's not that difficult to write a DNS resolver provider, and
some implementations already exist (dnsjava). We just need to make sure
that the new code takes this into account, so that it would be able to
use the DNS resolver _if_ one were available.
Well, I think the idea here was that the normalization rules could be
very dynamic, i.e. they could change dramatically between subsequent
runs of Generator - although I must say that I don't see this
happening in practice ...
Currently there is a normalization in the selector reduce. But that
normalized url never updates the entry.url, which is a bug, so is it
better to just remove the normalization or leave it. Here is the kicker
though, if we leave it, it is possible that duplicate urls will make it
through to the fetchlist.
That is assuming that normalization rules have changed since the last
updatedb - because urls that are being added to crawldb are already
passed through the normalizers, so among all urls in the crawldb there
should be no duplicates.
Perhaps Nutch should work on the following assumption: urls that are
found in Crawldb are guaranteed to be normalized. If different
normalization rules are needed then the crawldb needs to be explicitly
filtered in a separate step, using CrawlDbMerger tool.
We have been talking about this lately. It seems like there needs to be
two more tools or extension points.
One is a url translator. Something that would say url A is actually url
B. Anytime that url is seen it would be translated to its new form.
This would help with cases such as java.net vs www.java.net being two
separate urls. Jobs could be written to translate based on hash or
other values.
I agree - I think an extension point would be ideal here, because we
need to determine the "aliases" in many places during the regular
crawling cycle. This is closely related to the issue of handling
redirects and to the static scoring.
The second tool would be a tool that more completely manipulates the
crawldb (maybe this is the crawldbmerger). This would allow things like
resetting the crawl dates on all urls for the crawldb, normalizing
scores, things like that that are global operations on the entire
crawldb or a subset of the urls in the crawldb.
Yes, a sort of CrawlDbAdmin tool ... we could use the CrawlDbMerger as
one of the components. This would be a welcome contribution if someone
were to write this tool ;)
The DB management in Nutch is still somewhat inconvenient (perhaps this
will change when / if we start using HBase ;) ).
In the meantime, I have this wild idea of a tool, that consists of a
skeleton map-reduce job, tailored to consume and produce crawldb /
linkdb, and a set of Beanshell scripts that plug in to map() and
reduce() and perform basic admin tasks like filtering, normalization,
setting values, adding / removing metadata etc, etc. This way users
could relatively easily extend the administrative functions just by
writing scripts.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com