Hi,

On Thu, Jan 29, 2015 at 12:49 AM, <[email protected]> wrote:

>
> Ok, these are valid use cases. They have in common that
> the Nutch user owns the crawled servers or is (hopefully)
> explicitly allowed to perform the security research.
>

Another example would be a backend storage migration for all crawl data for
one or more DNS. I've done migrations for clients before and being able to
override robots.txt in order to get this done in a timely fashion has been
mutually beneficial. So you are absolutely right here Seb :)


>
>
> What about an option (or config file) to exclude explicitly
> a list of hosts (or IPs) from robots.txt parsing?
>

Like a whilelist. Say if I know the IP(s) or hosts I want to override
robots.txt for (this can be easily obtained by turning on store.ip.address
property) then I could write the hosts and IPs to a flat file which would
then be overridden. Is this what you are suggesting?


> That would require more effort to configure than a boolean property
> but because it's explicit, it prevents users from disabling
> robots.txt in general and also guarantees that
> the security research is not accidentally "extended"


And possibly this would be activated by a boolean property e.g.
use.robots.override.whitelist? In all honesty Sebb, I think that this
sounds a better compromise as you said, it is explicit. It is still pretty
easy to configure right enough. All you need to do is use parsechecker for
example, log the IP, add it to new configuration file, then override. It
seems good.
It actually reminds me of one f the very first patches I tried to take
on... which is still open OMFG
https://issues.apache.org/jira/browse/NUTCH-208
I need to sort this patch out and commit... 4 years is a terrible duration
of time to have left that one hanging!

Reply via email to