Re: Controlling Spiders

2012-05-12 Thread Richard Steele

Thanks to everyone for their thoughts on this. Very good advise and ideas here!
R 

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351119
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-08 Thread Claude Schnéegans

 this is the official way to do it.
http://www.robotstxt.org/

The problem with Robots.txt is that it is only obeyed by well behaved bots.
Well behaved bots won't cause traffic problem, even if they are a dozen sucking 
your site in the same time.
I've seen bad bots reading robots.txt and immediately request all exclusions.
Personaly, I use robots.txt to forbid a directory containing a trap for bad 
bots and shut the door to them.

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351058
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-08 Thread Andrew Scott

What would be nice if the companies that create the likes of IIS and Apache
and others, got together and defined a protocol at the server level, then
they can request what they would like and if the server refuses them, then
stiff shit jack.

I like rewrite rules for this purpose, and it maybe best at the firewall
level, but any that disobey the robots.txt file is placed into the rewrite
rules and I forward them to a sever 404 error.

-- 
Regards,
Andrew Scott
WebSite: http://www.andyscott.id.au/
Google+: http://plus.google.com/108193156965451149543



On Wed, May 9, 2012 at 12:20 AM,  wrote:


  this is the official way to do it.
 http://www.robotstxt.org/

 The problem with Robots.txt is that it is only obeyed by well behaved bots.
 Well behaved bots won't cause traffic problem, even if they are a dozen
 sucking your site in the same time.
 I've seen bad bots reading robots.txt and immediately request all
 exclusions.
 Personaly, I use robots.txt to forbid a directory containing a trap for
 bad bots and shut the door to them.

 

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351059
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-08 Thread Claude Schnéegans

 What would be nice if the companies that create the likes of IIS and Apache
and others, got together and defined a protocol at the server level.

IMHO, the only rule that would be really useful would be to force Robots to
1. identify the web page were they discribe their purpose and what they are 
lookin for,
2. give a list of IP addresses or range they are crawling from so we can white 
or black list them.

But asking them to obey robots.txt file is like relying on laws to keep 
burglars away from your house.

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351060
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-08 Thread Cameron Childress

On Mon, May 7, 2012 at 6:42 PM, Richard Steele r...@photoeye.com wrote:

 Today we had Google + 3 or 4 other spiders hammering our multi-instance
 server at the same time. Is there a way to control these bots to prevent
 them from submitting request after request? How do most high traffic
 servers handle this? Thanks!


I addition to the advice you've gotten (and please take this as
constructive advice) it's possible that if you are having trouble dealing
with search engine traffic - you do not, in fact, have a high traffic
server. This may be a good gut-check time to ask yourself Can we
handle increased load if our website becomes suddenly more
successful?. Seeing your website hiccup on search crawler traffic may be a
good early indicator that you need to do some stress testing and find
places to improve.

-Cameron

-- 
Cameron Childress
--
p:   678.637.5072
im: cameroncf
facebook http://www.facebook.com/cameroncf |
twitterhttp://twitter.com/cameronc |
google+ https://profiles.google.com/u/0/117829379451708140985


~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351064
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-08 Thread Brian Thornton

Here's my cent and a half.

I help a large ecommerce group 10x google traffic by presenting the
data to google by direct html and no ajax as google was not indexing
the ajaxan data.

That 10x traffice meant big business for the company and they decided
to push CF into a cluster group via ralio.

Just because it's high traffic doesn't mean it's high expense.

Treat google as the best sales guy your company can have and give it
what it wants..

Know that page load time to the crawler comes into play for the google
reports to reindex times etc. Check out the google webmaster tools and
put it through the tests there.

On Tue, May 8, 2012 at 1:21 PM, Cameron Childress camer...@gmail.com wrote:

 On Mon, May 7, 2012 at 6:42 PM, Richard Steele r...@photoeye.com wrote:

 Today we had Google + 3 or 4 other spiders hammering our multi-instance
 server at the same time. Is there a way to control these bots to prevent
 them from submitting request after request? How do most high traffic
 servers handle this? Thanks!


 I addition to the advice you've gotten (and please take this as
 constructive advice) it's possible that if you are having trouble dealing
 with search engine traffic - you do not, in fact, have a high traffic
 server. This may be a good gut-check time to ask yourself Can we
 handle increased load if our website becomes suddenly more
 successful?. Seeing your website hiccup on search crawler traffic may be a
 good early indicator that you need to do some stress testing and find
 places to improve.

 -Cameron

 --
 Cameron Childress
 --
 p:   678.637.5072
 im: cameroncf
 facebook http://www.facebook.com/cameroncf |
 twitterhttp://twitter.com/cameronc |
 google+ https://profiles.google.com/u/0/117829379451708140985


 

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351065
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Controlling Spiders

2012-05-07 Thread Richard Steele

Today we had Google + 3 or 4 other spiders hammering our multi-instance server 
at the same time. Is there a way to control these bots to prevent them from 
submitting request after request? How do most high traffic servers handle this? 
Thanks! 

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351030
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-07 Thread Russ Michaels

this is the official way to do it.
http://www.robotstxt.org/

Only the main search engines honour this though, the less popular search
spiders will ignore it and do as they like. for those you can do some web
content filtering or use a web application firewall to control activity on
your site.

On Mon, May 7, 2012 at 11:42 PM, Richard Steele r...@photoeye.com wrote:


 Today we had Google + 3 or 4 other spiders hammering our multi-instance
 server at the same time. Is there a way to control these bots to prevent
 them from submitting request after request? How do most high traffic
 servers handle this? Thanks!

 

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351031
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-07 Thread Money Pit

There is a robots.txt setting that may be of some use.

User-agent: *
Crawl-delay: 0.5

Tells all bots to only hit two pages per second.

I'm pretty sure Google does not follow this particular command, and I
know from sad experience that there are plenty of rogues out there who
will either pay lip service to or ignore the setting.  Google
Webmaster's Tools has a setting inside of it that will allow you ask
nicely to please consider throttling down some IIRC ... but the
reality I have found is - if you have a lot of pages that are
bot-popular... to truly solve the problem, you have to rethink what
you are doing.

A client of mine had a vehicle multiple listing service consisting of
tens of thousands of units up for sale, where each unit generated
three pages (a quick view, a full view and a picture page) and the
units available changed quite a bit in a given day ... bots knew this
and crawled and re-crawled him mercilessly despite all efforts to get
them to tone it down.  We kept throwing hardware at the problem after
increasing efficiency everywhere we could think of, until the next
step was a big one: Multiple CF Enterprise licenses and a cluster.

We found another solution:  Generation of static .html on the back end
as pages change instead of gratuitous use of .cfm's to effectively no
purpose, since the material only changed when the editor changed it
(very infrequent compared to the number of pages views) or the feed
from the third party came in overnight.

This approach increases the server's capacity to handle concurrent
traffic *immensely* but also poses multiple challenges.  Maintaining
session state is not the least of these, but also when dealing with
daily mammoth CSV and XML feeds from third parties, we had tens of
thousands of pages to generate or update (solution: use a second
server on a cheap VPS dedicated to feed processing and page creation).

Its definitely not for everyone.  We got away with it and for that
particular application it was a solution that allowed better overall
performance and low operating cost.  A rare win/win.

-- 
--m@Robertson--
Janitor, The Robertson Team
mysecretbase.com

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351032
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-07 Thread .jonah

Or if you do need session state and/or customized content per user, then 
a more sophisticated caching implementation is in order. Cache (on disk 
or in memory) the parts of the page that are expensive to produce and 
are common across all users and then include them into dynamic pages 
which require less processing but have the user-specific bits.

We're doing this with some of our sites. Cache the center of the page 
which is constant and then include it into a wrapper which contains all 
the per-user logic.

Even some judicious just of cfcache will get you a long way.

On 5/7/12 4:21 PM, Money Pit wrote:
 There is a robots.txt setting that may be of some use.

 User-agent: *
 Crawl-delay: 0.5

 Tells all bots to only hit two pages per second.

 I'm pretty sure Google does not follow this particular command, and I
 know from sad experience that there are plenty of rogues out there who
 will either pay lip service to or ignore the setting.  Google
 Webmaster's Tools has a setting inside of it that will allow you ask
 nicely to please consider throttling down some IIRC ... but the
 reality I have found is - if you have a lot of pages that are
 bot-popular... to truly solve the problem, you have to rethink what
 you are doing.

 A client of mine had a vehicle multiple listing service consisting of
 tens of thousands of units up for sale, where each unit generated
 three pages (a quick view, a full view and a picture page) and the
 units available changed quite a bit in a given day ... bots knew this
 and crawled and re-crawled him mercilessly despite all efforts to get
 them to tone it down.  We kept throwing hardware at the problem after
 increasing efficiency everywhere we could think of, until the next
 step was a big one: Multiple CF Enterprise licenses and a cluster.

 We found another solution:  Generation of static .html on the back end
 as pages change instead of gratuitous use of .cfm's to effectively no
 purpose, since the material only changed when the editor changed it
 (very infrequent compared to the number of pages views) or the feed
 from the third party came in overnight.

 This approach increases the server's capacity to handle concurrent
 traffic *immensely* but also poses multiple challenges.  Maintaining
 session state is not the least of these, but also when dealing with
 daily mammoth CSV and XML feeds from third parties, we had tens of
 thousands of pages to generate or update (solution: use a second
 server on a cheap VPS dedicated to feed processing and page creation).

 Its definitely not for everyone.  We got away with it and for that
 particular application it was a solution that allowed better overall
 performance and low operating cost.  A rare win/win.


~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351033
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-07 Thread Money Pit

On Mon, May 7, 2012 at 4:28 PM, .jonah jonah@creori.com wrote:

 Even some judicious just of cfcache will get you a long way.

Yup.  For us, the expensive stuff was unique per page, but also part
of the problem that we never seemed to be able to get a handle on was
the concurrency demands associated with having bots hit as many pages
as they could with as many threads as you set the CF server to allow.

For that matter, if there are common queries you can get an enormous
amount of mileage out of short query caches of two or three seconds in
duration.  For example, on one of those listings, say the dealer info
is a query.  The bot could hit the dealer's listings as a block, so if
you cache the dealer's data for lets say ten seconds, you can
eliminate a ton of hits to the db... and since the cache is short
lived, there's room for all of the other hundreds of dealers whose
material is being accessed - and hopefully cached to good effect - at
the same time.
Turning CF into a backend processor-only is no small task.  Its not
something you can do without a whole lot of planning and effort, which
is a good thing because unless you really need to, you shouldn't.

Something else to consider is rel=nofollow and hope you can exert some
control over redundant traffic flow.

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351034
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Controlling Spiders

2012-05-07 Thread Dave Watts

 Today we had Google + 3 or 4 other spiders hammering our multi-instance 
 server at the same time. Is there a way to
 control these bots to prevent them from submitting request after request? How 
 do most high traffic servers handle this?
 Thanks!

The other answers you've already received are on-point, so I won't
reiterate those. But in addition, it can be important to ensure that
you don't create a brand new session for each page request, as many
crawlers disregard cookies.

Also, high-traffic servers typically handle this sort of thing via
caching, which can be done many different ways, with different levels
of aggressiveness. For example, generating static HTML as Matt
mentioned is a pretty aggressive (and potentially very effective)
caching mechanism. High volume sites often use third-party CDNs to
take care of some of this as well.

Dave Watts, CTO, Fig Leaf Software
http://www.figleaf.com/
http://training.figleaf.com/

Fig Leaf Software is a Veteran-Owned Small Business (VOSB) on
GSA Schedule, and provides the highest caliber vendor-authorized
instruction at our training centers, online, or onsite.

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:351043
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm