Re: Bug report

Frank McCown Sat, 01 Apr 2006 12:38:30 -0800

Gary Reysa wrote:

Hi,
I don't really know if this is a Wget bug, or some problem with mywebsite, but, either way, maybe you can help.
I have a web site ( www.BuildItSolar.com ) with perhaps a few hundredpages (260MB of storage total). Someone did a Wget on my site, andmanaged to log 111,000 hits and 58,000 page views (using more than a GBof bandwidth).
I am wondering how this can happen, since the number of page views isabout 200 times the number of pages on my site??
Is there something I can do to prevent this? Is there something aboutthe organization of my website that is causing Wget to get stuck in a loop?
I've never used Wget, but I am guessing that this guy really did notwant 50,000+ pages -- do you provide some way for the user to shutitself down when it reaches some reasonable limit?
My website is non-commercial, and provides a lot of information thatpeople find useful in building renewable energy projects. It generateszero income, and I can't really afford to have a lot of people come inand burn up GBs of bandwidth to no useful end. Help!
Gary Reysa


Bozeman, MT
[EMAIL PROTECTED]


Hello Gary,

From a quick look at your site, it appears to be mainly static htmlthat would not generate a lot of extra crawls. If you have some dynamicportion of your site, like a calendar, that could make wget go into aninfinite loop. It would be much easier to tell if you could look at theserver logs that show what pages were requested. They would easily tellyou want wget was getting hung on.

One problem I did notice is that your site is generating "soft 404s".In other words, it is sending back a http 200 response when it should besending back a 404 response. So if wget tries to access


http://www.builditsolar.com/blah

your web server is telling wget that the page actually exists. This*could* cause more crawls than necessary, but not likely. This problemshould be fixed though.

It's possible the wget user did not know what they were doing and ranthe crawler several times. You could try to block traffic from thatparticular IP address or create a robots.txt file that tells crawlers tostay away from your site or just certain pages. Wget respectsrobots.txt. For more info:


http://www.robotstxt.org/wc/robots.html

Regards,
Frank

Re: Bug report

Reply via email to