Hey Markus,
On Nov 24, 2011, at 8:58 AM, Markus Jelsma wrote: > Hi devs, > > I stumbled upon the following user agent string: > > TestNutch/Nutch-1.2 (testing Nutch; http://nutch.apache.org; > [email protected]) > > It says it's from Apache but it seems is not, since ASF does not operate a > Nutch spider last time i checked. It doesn't, yet. But, as soon as I figure out how to get Nutch running for this stinkin' site I'm trying to crawl I am going to head over to infra@ and propose to set up a Nutch crawler. I'd also like to solidify some of the REST control URL stuff that Andrzej started to work on and get that going a little bit more, but I think it would be very useful to crawl, e.g., all the ASF sites. > Should we allow this? Do you mean should we block the use of Apache by some trickery in code? If that's the suggestion, I am not sure I'd be in favor of it. I'd rather make it easy to find out where the spider came from and help in the identification of the agent, before restricting what user's can put in that area. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

