or anti-scraping?

Mark Gandolfo Fri, 16 Apr 2010 17:48:58 -0700

You may be able to create a Rack middleware applet, that tracks IP's
and request times. If the same IP makes X amount of requests, take
them to another page. Although this poses many problems.


1. Spiders who are actually doing their job will have to be filtered
(googlebot, yahoo slurp, etc). These guys usually work from many
different servers from many different data centers, so IP address
filtering isn't an option and instead you will need to look into the
HTTP request headers to look for the bots signature.

2. A large office might how a single gateway to access the internet.
If this is the case, you might be blocking real user requests..

3. If the script used to scrape your site is working in a distributed
fashion, they'll have multiple IP addresses.

Depending on the type of spider they're using to spider your site,
maybe the HTTP headers might give some clue to the fact that the
spider isn't a human making real requests.. For instance, you might be
able to look at the user-agent string, timezones, etc. to work out if
it all adds up.

Maybe a suggestion would be to offer an API service to let others use
the information you have on your site??? Or obfuscate the html some
how to prevent easy structured access to your sites content.. I've
seen this done via crazy coded html, with javascript to decode on the
client side.

But as everyone else has already stated, the solution is difficult
because of the open state of HTML and the stateless aspect of HTTP.

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
or Rails Oceania" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rails-oceania?hl=en.

[rails-oceania] Re: Anyone got a good solution for NginX/Rails anti-ddos and/or anti-scraping?

Reply via email to