You may be able to create a Rack middleware applet, that tracks IP's and request times. If the same IP makes X amount of requests, take them to another page. Although this poses many problems.
1. Spiders who are actually doing their job will have to be filtered (googlebot, yahoo slurp, etc). These guys usually work from many different servers from many different data centers, so IP address filtering isn't an option and instead you will need to look into the HTTP request headers to look for the bots signature. 2. A large office might how a single gateway to access the internet. If this is the case, you might be blocking real user requests.. 3. If the script used to scrape your site is working in a distributed fashion, they'll have multiple IP addresses. Depending on the type of spider they're using to spider your site, maybe the HTTP headers might give some clue to the fact that the spider isn't a human making real requests.. For instance, you might be able to look at the user-agent string, timezones, etc. to work out if it all adds up. Maybe a suggestion would be to offer an API service to let others use the information you have on your site??? Or obfuscate the html some how to prevent easy structured access to your sites content.. I've seen this done via crazy coded html, with javascript to decode on the client side. But as everyone else has already stated, the solution is difficult because of the open state of HTML and the stateless aspect of HTTP. -- You received this message because you are subscribed to the Google Groups "Ruby or Rails Oceania" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/rails-oceania?hl=en.
