Seems like a problem with embedded params which was fixed some versions ago. 
But still, Nutch should not crawl anything in that directory regardless of that 
problem. Perhaps the bot operator removed the robots checking.

Cheers 
 
-----Original message-----
> From:Lewis John Mcgibbney <[email protected]>
> Sent: Fri 26-Oct-2012 15:42
> To: [email protected]
> Subject: Re: misbehaving crawler
> 
> What exactly is the issue here?
> 
> Lewis
> 
> On Thu, Oct 25, 2012 at 4:59 PM, Alex diNorcia <[email protected]> wrote:
> > http://alex.dinorcia.net/robots.txt has been in place and unchanged since
> > Aug 24  2004
> >
> > * i'd also point out that it's crawling poorly to boot. the original link it
> > got into the directory with was
> > http://alex.dinorcia.net/stuff_i_got_in_emails/?C=M;O=D
> > it appears to add the descending order part of the get variables to each
> > file and gets a 404 error.
> >
> > here are some of the 14516 log entries that are not obeying the rules :
> > 119.139.27.64 - - [25/Oct/2012:04:22:08 -0400] "GET
> > /stuff_i_got_in_emails/Japanese%20Engrish%204.jpg;O=D HTTP/1.0" 404 246 "-"
> > "HD nutch agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:05:20:50 -0400] "GET
> > /stuff_i_got_in_emails/LeafBlower.jpg;O=D HTTP/1.0" 404 238 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:06:26:43 -0400] "GET
> > /stuff_i_got_in_emails/snowmen3.gif;O=D HTTP/1.0" 404 236 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:07:01:49 -0400] "GET
> > /stuff_i_got_in_emails/Everything.About.The.Doctor.jpg;O=D HTTP/1.0" 404 255
> > "-" "HD nutch agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:08:12:06 -0400] "GET
> > /stuff_i_got_in_emails/fucked.jpg;O=D HTTP/1.0" 404 234 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:08:18:54 -0400] "GET
> > /stuff_i_got_in_emails/H28.gif;O=D HTTP/1.0" 404 231 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:08:26:50 -0400] "GET
> > /stuff_i_got_in_emails/Oprahs-Bees.gif;O=D HTTP/1.0" 404 239 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:08:50:31 -0400] "GET
> > /stuff_i_got_in_emails/Reindeer_Mural.jpg;O=D HTTP/1.0" 404 242 "-" "HD
> > nutch agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:09:02:52 -0400] "GET
> > /stuff_i_got_in_emails/snowmen4.gif;O=D HTTP/1.0" 404 236 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:09:04:52 -0400] "GET
> > /stuff_i_got_in_emails/ATT00173.jpg;O=D HTTP/1.0" 404 236 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:09:22:19 -0400] "GET
> > /stuff_i_got_in_emails/?C=S;O=A HTTP/1.0" 200 159957 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:10:55:09 -0400] "GET
> > /stuff_i_got_in_emails/outofthecloset%20(5).jpg;O=D HTTP/1.0" 404 246 "-"
> > "HD nutch agent/Nutch-1.1 (Think)"
> >
> >
> >
> 
> 
> 
> -- 
> Lewis
> 

Reply via email to