Gary Reysa wrote:
Hi,
I don't really know if this is a Wget bug, or some problem with my
website, but, either way, maybe you can help.
I have a web site ( www.BuildItSolar.com ) with perhaps a few hundred
pages (260MB of storage total). Someone did a Wget on my site, and
managed to log 111,000 hits and 58,000 page views (using more than a GB
of bandwidth).
I am wondering how this can happen, since the number of page views is
about 200 times the number of pages on my site??
Is there something I can do to prevent this? Is there something about
the organization of my website that is causing Wget to get stuck in a loop?
I've never used Wget, but I am guessing that this guy really did not
want 50,000+ pages -- do you provide some way for the user to shut
itself down when it reaches some reasonable limit?
My website is non-commercial, and provides a lot of information that
people find useful in building renewable energy projects. It generates
zero income, and I can't really afford to have a lot of people come in
and burn up GBs of bandwidth to no useful end. Help!
Gary Reysa
Bozeman, MT
[EMAIL PROTECTED]
Hello Gary,
From a quick look at your site, it appears to be mainly static html
that would not generate a lot of extra crawls. If you have some dynamic
portion of your site, like a calendar, that could make wget go into an
infinite loop. It would be much easier to tell if you could look at the
server logs that show what pages were requested. They would easily tell
you want wget was getting hung on.
One problem I did notice is that your site is generating "soft 404s".
In other words, it is sending back a http 200 response when it should be
sending back a 404 response. So if wget tries to access
http://www.builditsolar.com/blah
your web server is telling wget that the page actually exists. This
*could* cause more crawls than necessary, but not likely. This problem
should be fixed though.
It's possible the wget user did not know what they were doing and ran
the crawler several times. You could try to block traffic from that
particular IP address or create a robots.txt file that tells crawlers to
stay away from your site or just certain pages. Wget respects
robots.txt. For more info:
http://www.robotstxt.org/wc/robots.html
Regards,
Frank