On 7/23/2012 11:31 AM, Keith Lofstrom wrote: > This is not about dirvish, but about the website. Perhaps some > of you sysadmins can help. > > You may occasionally see the dirvish.org website stop responding > to web requests. > > dirvish.org is running on my virtual machine at rimuhosting in > Dallas, along with half a dozen other low-usage sites. Some of > the contents on other sites are lectures and videos, about 5GB > of total content. > > Baidu, the Chinese search engine, spiders the net every 15 minutes, > looking for changes. Which means it attempts to download 20GB > an hour from my server. Sometimes it does not complete the requests > in time, and they accumulate. During the last slowdown, netstat > reported 140 open ports to baiduspider, including many big files. > Apache stopped taking most new requests, and browsers timed out.
There are a variety of HTTP headers that are important for telling User Agents like spiders how to cache responses. Important headers include Etag, Last-Modified, Expires, and Cache-Control. When I visit dirvish.org, I see both the Etag and Last-Modified headers which should allow reasonable caching of that page, but when visiting any page under wiki.dirvish.org, they are missing which is typical of dynamically generated pages. There are actually two key events in play, first a page can be considered to be valid for so long. While a page is valid (or fresh), it does not need to be checked for freshness and a cached copy can be used with no network traffic needed. The second is the time frame a page can be cached but may be invalid. In this case, if there was enough information in the initial response, the User Agent can issue a conditional-GET request asking for any updates. If the old page is still valid, the web server simply responds with a "304 Not Modified" which can save on a lot of bandwidth for larger files. The MoinMoin wiki recommends setting up mod_expires to help, but does not seem to support Etag yet. mod_expires should help regardless. Look at the bottom of this page: http://moinmo.in/AutoUpdatingStuff If you have other dynamic pages, or even static pages, there should be add-ons or Apache tweaks to improve the freshness and caching behavior. I'm using the Live HTTP Headers add-on for Firefox to see what headers are being provided as I navigate the Dirvish website. > > As a temporary measure, I've disallowed baiduspider in robots.txt > for all my sites. I will move the videos and large files to some > of the free file hosting services over time. But I want to keep > serving China's 20% of the world's population with reasonably > up-to-date search results. So, the question: > > Is there any way to tell the search spiders to visit once a day > or once a week, rather than four times per hour? Or send them > "recent changes" lists instead of them repeatedly downloading the > same files? Any other ideas for calming down the web crawlers? > > Keith > -- Loren M. Lang [email protected] http://www.alzatex.com/ Public Key: ftp://ftp.tallye.com/pub/lorenl_pubkey.asc Fingerprint: 10A0 7AE2 DAF5 4780 888A 3FA4 DCEE BB39 7654 DE5B _______________________________________________ Dirvish mailing list [email protected] http://www.dirvish.org/mailman/listinfo/dirvish
