Re: Controlling Spiders
Thanks to everyone for their thoughts on this. Very good advise and ideas here! R ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351119 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
this is the official way to do it. http://www.robotstxt.org/ The problem with Robots.txt is that it is only obeyed by well behaved bots. Well behaved bots won't cause traffic problem, even if they are a dozen sucking your site in the same time. I've seen bad bots reading robots.txt and immediately request all exclusions. Personaly, I use robots.txt to forbid a directory containing a trap for bad bots and shut the door to them. ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351058 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
What would be nice if the companies that create the likes of IIS and Apache and others, got together and defined a protocol at the server level, then they can request what they would like and if the server refuses them, then stiff shit jack. I like rewrite rules for this purpose, and it maybe best at the firewall level, but any that disobey the robots.txt file is placed into the rewrite rules and I forward them to a sever 404 error. -- Regards, Andrew Scott WebSite: http://www.andyscott.id.au/ Google+: http://plus.google.com/108193156965451149543 On Wed, May 9, 2012 at 12:20 AM, wrote: this is the official way to do it. http://www.robotstxt.org/ The problem with Robots.txt is that it is only obeyed by well behaved bots. Well behaved bots won't cause traffic problem, even if they are a dozen sucking your site in the same time. I've seen bad bots reading robots.txt and immediately request all exclusions. Personaly, I use robots.txt to forbid a directory containing a trap for bad bots and shut the door to them. ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351059 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
What would be nice if the companies that create the likes of IIS and Apache and others, got together and defined a protocol at the server level. IMHO, the only rule that would be really useful would be to force Robots to 1. identify the web page were they discribe their purpose and what they are lookin for, 2. give a list of IP addresses or range they are crawling from so we can white or black list them. But asking them to obey robots.txt file is like relying on laws to keep burglars away from your house. ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351060 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
On Mon, May 7, 2012 at 6:42 PM, Richard Steele r...@photoeye.com wrote: Today we had Google + 3 or 4 other spiders hammering our multi-instance server at the same time. Is there a way to control these bots to prevent them from submitting request after request? How do most high traffic servers handle this? Thanks! I addition to the advice you've gotten (and please take this as constructive advice) it's possible that if you are having trouble dealing with search engine traffic - you do not, in fact, have a high traffic server. This may be a good gut-check time to ask yourself Can we handle increased load if our website becomes suddenly more successful?. Seeing your website hiccup on search crawler traffic may be a good early indicator that you need to do some stress testing and find places to improve. -Cameron -- Cameron Childress -- p: 678.637.5072 im: cameroncf facebook http://www.facebook.com/cameroncf | twitterhttp://twitter.com/cameronc | google+ https://profiles.google.com/u/0/117829379451708140985 ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351064 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
Here's my cent and a half. I help a large ecommerce group 10x google traffic by presenting the data to google by direct html and no ajax as google was not indexing the ajaxan data. That 10x traffice meant big business for the company and they decided to push CF into a cluster group via ralio. Just because it's high traffic doesn't mean it's high expense. Treat google as the best sales guy your company can have and give it what it wants.. Know that page load time to the crawler comes into play for the google reports to reindex times etc. Check out the google webmaster tools and put it through the tests there. On Tue, May 8, 2012 at 1:21 PM, Cameron Childress camer...@gmail.com wrote: On Mon, May 7, 2012 at 6:42 PM, Richard Steele r...@photoeye.com wrote: Today we had Google + 3 or 4 other spiders hammering our multi-instance server at the same time. Is there a way to control these bots to prevent them from submitting request after request? How do most high traffic servers handle this? Thanks! I addition to the advice you've gotten (and please take this as constructive advice) it's possible that if you are having trouble dealing with search engine traffic - you do not, in fact, have a high traffic server. This may be a good gut-check time to ask yourself Can we handle increased load if our website becomes suddenly more successful?. Seeing your website hiccup on search crawler traffic may be a good early indicator that you need to do some stress testing and find places to improve. -Cameron -- Cameron Childress -- p: 678.637.5072 im: cameroncf facebook http://www.facebook.com/cameroncf | twitterhttp://twitter.com/cameronc | google+ https://profiles.google.com/u/0/117829379451708140985 ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351065 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Controlling Spiders
Today we had Google + 3 or 4 other spiders hammering our multi-instance server at the same time. Is there a way to control these bots to prevent them from submitting request after request? How do most high traffic servers handle this? Thanks! ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351030 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
this is the official way to do it. http://www.robotstxt.org/ Only the main search engines honour this though, the less popular search spiders will ignore it and do as they like. for those you can do some web content filtering or use a web application firewall to control activity on your site. On Mon, May 7, 2012 at 11:42 PM, Richard Steele r...@photoeye.com wrote: Today we had Google + 3 or 4 other spiders hammering our multi-instance server at the same time. Is there a way to control these bots to prevent them from submitting request after request? How do most high traffic servers handle this? Thanks! ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351031 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
There is a robots.txt setting that may be of some use. User-agent: * Crawl-delay: 0.5 Tells all bots to only hit two pages per second. I'm pretty sure Google does not follow this particular command, and I know from sad experience that there are plenty of rogues out there who will either pay lip service to or ignore the setting. Google Webmaster's Tools has a setting inside of it that will allow you ask nicely to please consider throttling down some IIRC ... but the reality I have found is - if you have a lot of pages that are bot-popular... to truly solve the problem, you have to rethink what you are doing. A client of mine had a vehicle multiple listing service consisting of tens of thousands of units up for sale, where each unit generated three pages (a quick view, a full view and a picture page) and the units available changed quite a bit in a given day ... bots knew this and crawled and re-crawled him mercilessly despite all efforts to get them to tone it down. We kept throwing hardware at the problem after increasing efficiency everywhere we could think of, until the next step was a big one: Multiple CF Enterprise licenses and a cluster. We found another solution: Generation of static .html on the back end as pages change instead of gratuitous use of .cfm's to effectively no purpose, since the material only changed when the editor changed it (very infrequent compared to the number of pages views) or the feed from the third party came in overnight. This approach increases the server's capacity to handle concurrent traffic *immensely* but also poses multiple challenges. Maintaining session state is not the least of these, but also when dealing with daily mammoth CSV and XML feeds from third parties, we had tens of thousands of pages to generate or update (solution: use a second server on a cheap VPS dedicated to feed processing and page creation). Its definitely not for everyone. We got away with it and for that particular application it was a solution that allowed better overall performance and low operating cost. A rare win/win. -- --m@Robertson-- Janitor, The Robertson Team mysecretbase.com ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351032 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
Or if you do need session state and/or customized content per user, then a more sophisticated caching implementation is in order. Cache (on disk or in memory) the parts of the page that are expensive to produce and are common across all users and then include them into dynamic pages which require less processing but have the user-specific bits. We're doing this with some of our sites. Cache the center of the page which is constant and then include it into a wrapper which contains all the per-user logic. Even some judicious just of cfcache will get you a long way. On 5/7/12 4:21 PM, Money Pit wrote: There is a robots.txt setting that may be of some use. User-agent: * Crawl-delay: 0.5 Tells all bots to only hit two pages per second. I'm pretty sure Google does not follow this particular command, and I know from sad experience that there are plenty of rogues out there who will either pay lip service to or ignore the setting. Google Webmaster's Tools has a setting inside of it that will allow you ask nicely to please consider throttling down some IIRC ... but the reality I have found is - if you have a lot of pages that are bot-popular... to truly solve the problem, you have to rethink what you are doing. A client of mine had a vehicle multiple listing service consisting of tens of thousands of units up for sale, where each unit generated three pages (a quick view, a full view and a picture page) and the units available changed quite a bit in a given day ... bots knew this and crawled and re-crawled him mercilessly despite all efforts to get them to tone it down. We kept throwing hardware at the problem after increasing efficiency everywhere we could think of, until the next step was a big one: Multiple CF Enterprise licenses and a cluster. We found another solution: Generation of static .html on the back end as pages change instead of gratuitous use of .cfm's to effectively no purpose, since the material only changed when the editor changed it (very infrequent compared to the number of pages views) or the feed from the third party came in overnight. This approach increases the server's capacity to handle concurrent traffic *immensely* but also poses multiple challenges. Maintaining session state is not the least of these, but also when dealing with daily mammoth CSV and XML feeds from third parties, we had tens of thousands of pages to generate or update (solution: use a second server on a cheap VPS dedicated to feed processing and page creation). Its definitely not for everyone. We got away with it and for that particular application it was a solution that allowed better overall performance and low operating cost. A rare win/win. ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351033 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
On Mon, May 7, 2012 at 4:28 PM, .jonah jonah@creori.com wrote: Even some judicious just of cfcache will get you a long way. Yup. For us, the expensive stuff was unique per page, but also part of the problem that we never seemed to be able to get a handle on was the concurrency demands associated with having bots hit as many pages as they could with as many threads as you set the CF server to allow. For that matter, if there are common queries you can get an enormous amount of mileage out of short query caches of two or three seconds in duration. For example, on one of those listings, say the dealer info is a query. The bot could hit the dealer's listings as a block, so if you cache the dealer's data for lets say ten seconds, you can eliminate a ton of hits to the db... and since the cache is short lived, there's room for all of the other hundreds of dealers whose material is being accessed - and hopefully cached to good effect - at the same time. Turning CF into a backend processor-only is no small task. Its not something you can do without a whole lot of planning and effort, which is a good thing because unless you really need to, you shouldn't. Something else to consider is rel=nofollow and hope you can exert some control over redundant traffic flow. ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351034 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Controlling Spiders
Today we had Google + 3 or 4 other spiders hammering our multi-instance server at the same time. Is there a way to control these bots to prevent them from submitting request after request? How do most high traffic servers handle this? Thanks! The other answers you've already received are on-point, so I won't reiterate those. But in addition, it can be important to ensure that you don't create a brand new session for each page request, as many crawlers disregard cookies. Also, high-traffic servers typically handle this sort of thing via caching, which can be done many different ways, with different levels of aggressiveness. For example, generating static HTML as Matt mentioned is a pretty aggressive (and potentially very effective) caching mechanism. High volume sites often use third-party CDNs to take care of some of this as well. Dave Watts, CTO, Fig Leaf Software http://www.figleaf.com/ http://training.figleaf.com/ Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on GSA Schedule, and provides the highest caliber vendor-authorized instruction at our training centers, online, or onsite. ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351043 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm