Hey Markus,

On Nov 24, 2011, at 8:58 AM, Markus Jelsma wrote:

> Hi devs,
> 
> I stumbled upon the following user agent string:
> 
> TestNutch/Nutch-1.2 (testing Nutch; http://nutch.apache.org; 
> [email protected])
> 
> It says it's from Apache but it seems is not, since ASF does not operate a 
> Nutch spider last time i checked.

It doesn't, yet. But, as soon as I figure out how to get Nutch running
for this stinkin' site I'm trying to crawl I am going to head over to infra@ 
and propose to set up a Nutch crawler. I'd also like to solidify some of the 
REST control URL stuff that Andrzej started to work on and get that going
a little bit more, but I think it would be very useful to crawl, e.g., all the 
ASF sites. 

> Should we allow this?

Do you mean should we block the use of Apache by some trickery 
in code?

If that's the suggestion, I am not sure I'd be in favor of it. I'd rather 
make it easy to find out where the spider came from and help 
in the identification of the agent, before restricting what user's can 
put in that area.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to