On Thursday 24 November 2011 18:05:30 Mattmann, Chris A (388J) wrote: > Hey Markus, > > On Nov 24, 2011, at 8:58 AM, Markus Jelsma wrote: > > Hi devs, > > > > I stumbled upon the following user agent string: > > > > TestNutch/Nutch-1.2 (testing Nutch; http://nutch.apache.org; > > [email protected]) > > > > It says it's from Apache but it seems is not, since ASF does not operate > > a Nutch spider last time i checked. > > It doesn't, yet. But, as soon as I figure out how to get Nutch running > for this stinkin' site I'm trying to crawl I am going to head over to > infra@ and propose to set up a Nutch crawler. I'd also like to solidify > some of the REST control URL stuff that Andrzej started to work on and get > that going a little bit more, but I think it would be very useful to > crawl, e.g., all the ASF sites.
We do that too. Take care, it's a _lot_ of data and quickly changing. :D > > > Should we allow this? > > Do you mean should we block the use of Apache by some trickery > in code? > > If that's the suggestion, I am not sure I'd be in favor of it. I'd rather > make it easy to find out where the spider came from and help > in the identification of the agent, before restricting what user's can > put in that area. Yes. Hardcode prevent use of Apache in one of the strings. It's easy to take out but i guess a lot of users never actually compile the thing. Well, i think restriction is indeed not very proper within ASF. Thanks > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -- Markus Jelsma - CTO - Openindex

