On Thursday 24 November 2011 18:05:30 Mattmann, Chris A (388J) wrote:
> Hey Markus,
> 
> On Nov 24, 2011, at 8:58 AM, Markus Jelsma wrote:
> > Hi devs,
> > 
> > I stumbled upon the following user agent string:
> > 
> > TestNutch/Nutch-1.2 (testing Nutch; http://nutch.apache.org;
> > [email protected])
> > 
> > It says it's from Apache but it seems is not, since ASF does not operate
> > a Nutch spider last time i checked.
> 
> It doesn't, yet. But, as soon as I figure out how to get Nutch running
> for this stinkin' site I'm trying to crawl I am going to head over to
> infra@ and propose to set up a Nutch crawler. I'd also like to solidify
> some of the REST control URL stuff that Andrzej started to work on and get
> that going a little bit more, but I think it would be very useful to
> crawl, e.g., all the ASF sites.

We do that too. Take care, it's a _lot_ of data and quickly changing. :D

> 
> > Should we allow this?
> 
> Do you mean should we block the use of Apache by some trickery
> in code?
> 
> If that's the suggestion, I am not sure I'd be in favor of it. I'd rather
> make it easy to find out where the spider came from and help
> in the identification of the agent, before restricting what user's can
> put in that area.

Yes. Hardcode prevent use of Apache in one of the strings. It's easy to take 
out but i guess a lot of users never actually compile the thing. Well, i think 
restriction is indeed not very proper within ASF.

Thanks

> 
> Cheers,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

-- 
Markus Jelsma - CTO - Openindex

Reply via email to