Gary Reysa wrote:
Hi,

I don't really know if this is a Wget bug, or some problem with my website, but, either way, maybe you can help.

I have a web site ( www.BuildItSolar.com ) with perhaps a few hundred pages (260MB of storage total). Someone did a Wget on my site, and managed to log 111,000 hits and 58,000 page views (using more than a GB of bandwidth).

I am wondering how this can happen, since the number of page views is about 200 times the number of pages on my site??

Is there something I can do to prevent this? Is there something about the organization of my website that is causing Wget to get stuck in a loop?

I've never used Wget, but I am guessing that this guy really did not want 50,000+ pages -- do you provide some way for the user to shut itself down when it reaches some reasonable limit?

My website is non-commercial, and provides a lot of information that people find useful in building renewable energy projects. It generates zero income, and I can't really afford to have a lot of people come in and burn up GBs of bandwidth to no useful end. Help!

Gary Reysa


Bozeman, MT
[EMAIL PROTECTED]


Hello Gary,

From a quick look at your site, it appears to be mainly static html that would not generate a lot of extra crawls. If you have some dynamic portion of your site, like a calendar, that could make wget go into an infinite loop. It would be much easier to tell if you could look at the server logs that show what pages were requested. They would easily tell you want wget was getting hung on.

One problem I did notice is that your site is generating "soft 404s". In other words, it is sending back a http 200 response when it should be sending back a 404 response. So if wget tries to access

http://www.builditsolar.com/blah

your web server is telling wget that the page actually exists. This *could* cause more crawls than necessary, but not likely. This problem should be fixed though.

It's possible the wget user did not know what they were doing and ran the crawler several times. You could try to block traffic from that particular IP address or create a robots.txt file that tells crawlers to stay away from your site or just certain pages. Wget respects robots.txt. For more info:

http://www.robotstxt.org/wc/robots.html

Regards,
Frank

  • Bug report Gary Reysa
    • Re: Bug report Frank McCown

Reply via email to