Re: Scroogle and Tor
On Tuesday 15 February 2011 05:20:21 Mike Perry wrote: I was under the impression that we hacked it to also be memory-only, though. But you're right, if I toggle Torbutton to clear my cache, Polipo's is still there... The polipo shipped in the tor bundles has the cache turned off, but any non-Windows users will tend to use the polipo shipped by their distro - with caching turned on. *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/
Re: Scroogle and Tor
Thus spake Matthew (pump...@cotse.net): On 13/02/11 19:09, scroo...@lavabit.com wrote: I've been fighting two different Tor users for a week. Each is apparently having a good time trying to see how quickly they can get results from Scroogle searches via Tor exit nodes. The fastest I've seen is about two per second. Since Tor users are only two percent of all Scroogle searches, I'm not adverse to blocking all Tor exits for a while when all else fails. These two Tor users were rotating their search terms, and one also switched his user-agent once. You can see why I might be tempted to throw my block all Tor switch on occasion -- sometimes there's no other way to convince the bad guy that he's not going to succeed. For the less than knowledgeable people amongst us (e.g me) who want to learn a bit more: what was the rationale for those two Tor users doing what they did? What do they get from it? I second this. Daniel, If you can find a way to fingerprint these bots, my suggestion would be to observe the types of queries they are running (perhaps for some of their earlier runs from when you could ban them by user agent?). One of the things Google does is actually decide your 'Captchaness' based on the content of your queries. Well, at least I suspect that's what they are doing, because I have been able to more reliably reproduce torbutton Captcha-related bugs when I try hard to write queries like robots that are looking for php sites to exploit. I would love to hear more about the types of scrapers that abuse Tor. Or rather, I would like to see if someone can at least identify rational behavior behind scrapers that abuse Tor. Some of it could also be misdirected malware that is operating from within Torified browsers. Some of it could also be deliberately torified malware. Google won't tell us any of this, obviously ;). -- Mike Perry Mad Computer Scientist fscked.org evil labs pgpVxq8YphoPj.pgp Description: PGP signature
Re: Scroogle and Tor
scroo...@lavabit.com wrote: I've been fighting two different Tor users for a week. Each is apparently having a good time trying to see how quickly they can get results from Scroogle searches via Tor exit nodes. [snip] As the person who (recently) raised the question about the availability of Scroogle via Tor, I want to thank you both for running Scroogle and for coming on this list to explain what happened. I also apologize to the list for not mentioning that Scroogle is once again available via Tor. (I discovered that and meant to publish that fact aprox. 24 hours ago.) You are obviously much more knowledgable about network issues than I am so I will leave it to others to advise you about possible mitigations for your problems. It is a real shame about the script kiddies, but such is the world we live in. Jim *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/
Re: Scroogle and Tor
Some have wondered why anyone would want to abuse Scroogle using Tor. Apart from some malicious types that may be doing it for their own amusement, it looks to me like they are trying to datamine Google -- arguably the largest, most diverse database on the planet. If you can manage to run a script 24/7 that datamines Google, you can monetize your results. Search engine optimizers would like to be able to do this. So would various directory builders. Doing it by scraping google.com directly is not easy. Scroogle provides 100 links of organic results per request, with less than one-half the byte-bloat that Google delivers for the same links and snippets. It is also much easier to parse Scroogle's simple output page than it is to parse Google's output page. I spend a couple hours per day blocking abusers. A huge amount of this is done through a couple dozen monitoring programs I've written, but for the most part these programs provide candidates for blocking only, and my wetware is needed to make the final determination. My efforts to counter abuse occasionally cause some programmers to consider using Tor to get Scroogle's results. About a year ago I began requiring any and all Tor searches at Scroogle to use SSL. Using SSL is always a good idea, but the main reason I did this is that the SSL requirement discouraged script writers who didn't know how to add this to their scripts. This policy helped immensely in cutting back on the abuse I was seeing from Tor. Now I'm seeing script writers who have solved the SSL problem. This leaves me with the user-agent, the search terms, and as a last resort, blocking Tor exit nodes. If they vary their search terms and user-agents, it can take hours to analyze patterns and accurately block them by returning a blank page. That's the way I prefer to do it, because I don't like to block Tor exit nodes. Those who are most sympathetic with what Tor is doing are also sympathetic with what Scroogle is doing. There's a lot of collateral damage associated with blocking Tor exit nodes, and I don't want to alienate the Tor community except as a last resort. One reason why Scroogle has lasted for more than six years is that we are nonprofit, and Google knows by now that I don't tolerate abuse. My job is to stop the abuser before Scroogle passes their search terms to Google. Abusers who use Tor make this more difficult for me. Blocking an IP address is easy, but blocking Tor abusers without alienating other Tor users is more complex. -- Daniel Brandt *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/
Re: Scroogle and Tor
On 02/14/2011 06:29 PM, scroo...@lavabit.com wrote: Some have wondered why anyone would want to abuse Scroogle using Tor. Apart from some malicious types that may be doing it for their own amusement, it looks to me like they are trying to datamine Google -- arguably the largest, most diverse database on the planet. Makes a lot of sense. Actually, can hardly blame them for wanting to mine the data. Of course, you make it pretty easily available, as you detail. I can see why this starts to present a problem. I spend a couple hours per day blocking abusers. A huge amount of this is done through a couple dozen monitoring programs I've written, but for the most part these programs provide candidates for blocking only, and my wetware is needed to make the final determination. Ouch, that really sucks... time like that adds up fast. Now I'm seeing script writers who have solved the SSL problem. This leaves me with the user-agent, the search terms, and as a last resort, blocking Tor exit nodes. If they vary their search terms and user-agents, it can take hours to analyze patterns and accurately block them by returning a blank page. That's the way I prefer to do it, because I don't like to block Tor exit nodes. Those who are most sympathetic with what Tor is doing are also sympathetic with what Scroogle is doing. There's a lot of collateral damage associated with blocking Tor exit nodes, and I don't want to alienate the Tor community except as a last resort. Well...google uses the captcha system. Hard to say how well that works. I doubt anything too simple is going to work here, for many reasons, including the ones that you specify. How about this... we know you can (mostly reliably) detect tor exits. I think you have your goals wrong. You don't need to stop the scripts from getting to google, even google can't stop that on their own site. What you need is to make abusive use unprofitable on a scale that matters. Tor users care about their privacy right... but you need a way to differentiate them. So how about a temporary registration system? I get sent to a page with a captcha (or two kinds even). If I pass, then I get a token (set in a cookie, or put in the query string) that lets me do searches. Maybe I can set when it should expire (up to a max) maybe put in a 30 second timeout before it becomes active. (slow them down some more)... maybe limit the rate per ip over time for registrations? Secondly, have you considered poisoning their stream? If you detect an obvious abusive script, return randomized cached results. Ruining their work, rather than just slowing them down, might convince them to move on and try somewhere else. It is a thought anyway. One reason why Scroogle has lasted for more than six years is that we are nonprofit, and Google knows by now that I don't tolerate abuse. My job is to stop the abuser before Scroogle passes their search terms to Google. Abusers who use Tor make this more difficult for me. Blocking an IP address is easy, but blocking Tor abusers without alienating other Tor users is more complex. It will be sad to see tor users lose your service (I actually had only heard the name before this thread, very curious to check it out now). -Steve *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/
Re: Scroogle and Tor
Thus spake scroo...@lavabit.com (scroo...@lavabit.com): My efforts to counter abuse occasionally cause some programmers to consider using Tor to get Scroogle's results. About a year ago I began requiring any and all Tor searches at Scroogle to use SSL. Using SSL is always a good idea, but the main reason I did this is that the SSL requirement discouraged script writers who didn't know how to add this to their scripts. This policy helped immensely in cutting back on the abuse I was seeing from Tor. Now I'm seeing script writers who have solved the SSL problem. This leaves me with the user-agent, the search terms, and as a last resort, blocking Tor exit nodes. If they vary their search terms and user-agents, it can take hours to analyze patterns and accurately block them by returning a blank page. That's the way I prefer to do it, because I don't like to block Tor exit nodes. Those who are most sympathetic with what Tor is doing are also sympathetic with what Scroogle is doing. There's a lot of collateral damage associated with blocking Tor exit nodes, and I don't want to alienate the Tor community except as a last resort. Great, now that we know the motivations of the scrapers and a history of the arms race so far, it becomes a bit easier to try to do some things to mitigate their efforts. I particularly like the idea of feeding them random, incorrect search results when you can fingerprint them. If you want my suggestions for next steps in the arms race for this, (having written some benevolent scrapers and web scanners myself), it would actually be to do things that require your adversary to implement and load more and more bits of a proper web browser into their crawlers for them to succeed in properly issuing queries to you. Some examples: 1. A couple layers of crazy CSS. If you use CSS style sheets that fetch other randomly generated and programmatically controlled style elements that are also keyed to the form submit for the search query (via an extra hidden parameter or something that is their hash), then you can verify on your server side that a given query also loaded sufficient CSS to be genuine. The problem with this is it will mess with people who use your search plugin or search keywords, but you could also do it in a brief landing page that is displayed *after* the query, but before a 302 or meta-refresh to actual results, for problem IPs. 2. Storing identifiers in the cache http://crypto.stanford.edu/sameorigin/safecachetest.html has some PoC of this. Torbutton protects against long-term cache identifiers, but for performance reasons the memory cache is enabled by default, so you could use this to differentiate crawlers who do not properly obey all brower caching sematics. Caching is actually pretty darn hard to get right, so there's probably quite a bit more room here than just plain identifiers. 3. Javascript proof of work If the client supports javascript, you can have them factor some medium-sized integers and post the factorization with the query string, to prove some level of periodic work. The factors could be stored in cookies and given a lifetime. The obvious downside of this is that I bet a fair share of your users are running NoScript, or prefer to disable js and cookies. Anyways, thanks for your efforts with Scroogle. Hopefully the above ideas are actually easy enough to implement on your infrastructure to make it worth your while to use for all problem IPs, not just Tor. -- Mike Perry Mad Computer Scientist fscked.org evil labs pgpDQruQ8zLhC.pgp Description: PGP signature
Re: Scroogle and Tor
On Mon, 14 Feb 2011 20:19:50 -0800 Mike Perry mikepe...@fscked.org wrote: 2. Storing identifiers in the cache http://crypto.stanford.edu/sameorigin/safecachetest.html has some PoC of this. Torbutton protects against long-term cache identifiers, but for performance reasons the memory cache is enabled by default, so you could use this to differentiate crawlers who do not properly obey all brower caching sematics. Caching is actually pretty darn hard to get right, so there's probably quite a bit more room here than just plain identifiers. Polipo monkey-wrenches Torbutton's protection against long-term cache identifiers. Robert Ransom signature.asc Description: PGP signature
Re: Scroogle and Tor
Thus spake Robert Ransom (rransom.8...@gmail.com): On Mon, 14 Feb 2011 20:19:50 -0800 Mike Perry mikepe...@fscked.org wrote: 2. Storing identifiers in the cache http://crypto.stanford.edu/sameorigin/safecachetest.html has some PoC of this. Torbutton protects against long-term cache identifiers, but for performance reasons the memory cache is enabled by default, so you could use this to differentiate crawlers who do not properly obey all brower caching sematics. Caching is actually pretty darn hard to get right, so there's probably quite a bit more room here than just plain identifiers. Polipo monkey-wrenches Torbutton's protection against long-term cache identifiers. I hate polipo. I've been trying ignore it until it fucking dies. But it's like a zombie that just won't stop gnawing on our brains. Worse, a crack smoking zombie that got us all addicted to it through second hand crack smoke. Or something. But hey, it's better than privoxy. Maybe? I was under the impression that we hacked it to also be memory-only, though. But you're right, if I toggle Torbutton to clear my cache, Polipo's is still there... -- Mike Perry Mad Computer Scientist fscked.org evil labs pgpgDTEhULdw5.pgp Description: PGP signature
Re: Scroogle and Tor
On Sun, Feb 13, 2011 at 2:09 PM, scroo...@lavabit.com wrote: [snip] I'm getting to the point where I'm tempted to offer my two exit node lists (yesterday plus today, and previous six days plus today) to the public. If I had more confidence in the lists currently available to the public, I wouldn't be tempted to do this. You should. The current public exit service is demonstrably incorrect. Although it's also important to know why it's incorrect. For example, one reason that the DNSEL is incorrect is a side effect of that fact that they are tested to see what address they _really_ exit from. Sometimes an exit is placed behind some proxy and the address that it claims to be is not the address anyone else sees. But— if an exit has a policy so narrow that it can not be tested by this process then it will not show up in the DNSEL results. So, e.g. if I ran a scroogle only exit, it wouldn't be in the DNSEL results. I'm pretty sure this is the wrong failure mode for the testing process. Though this issue means that your non-testing based results will also be incorrect, just in another way. There may also be other issues with the DNSEL result which I am unaware of. The daily/weekly cycle part just sounds like the pattern of nodes hitting their transfer limits and shutting off. Perhaps the DNSEL is promptly delisting these nodes when there should be a hold-up because the DNSEL results are cached. As far as performance goes, you can download a list of nodes which can reach a particular address at https://check.torproject.org/cgi-bin/TorBulkExitList.py?ip=1.2.3.4 but, these results have the same problem with omitted nodes that I mentioned. As far as the annoying requests from tor goes, it would be better to subject them to a captcha than to block them completely. Then again, the big reason people use scroogle via tor is, as I understand it, to avoid the annoying captchas that google often subjects tor exits to... *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/
Re: Scroogle and Tor
On 13/02/11 19:09, scroo...@lavabit.com wrote: I've been fighting two different Tor users for a week. Each is apparently having a good time trying to see how quickly they can get results from Scroogle searches via Tor exit nodes. The fastest I've seen is about two per second. Since Tor users are only two percent of all Scroogle searches, I'm not adverse to blocking all Tor exits for a while when all else fails. These two Tor users were rotating their search terms, and one also switched his user-agent once. You can see why I might be tempted to throw my block all Tor switch on occasion -- sometimes there's no other way to convince the bad guy that he's not going to succeed. For the less than knowledgeable people amongst us (e.g me) who want to learn a bit more: what was the rationale for those two Tor users doing what they did? What do they get from it? Incidentally, I use the SSL version of Scroogle (sometimes with Tor, sometimes without) because a) no CAPTCHAs b) I appreciate your privacy-minded ethos (ideology). It would be a shame if you had to block Tor users because of an abusive minority. When a nonprofit such as the Tor Project or Scroogle offers a public service, the script kiddies should have more respect. I don't expect everyone to donate to Tor and Scroogle, but I do expect that no one will steal time and effort from us. By the way, my block all Tor options for my Scroogle servers use an expanded definition of which IPs are Tor exit nodes. I pull the blutmagie.de exit node list, or the torproject.org exit node list (both port 80 and port 443) once per half hour, alternating between the two sites. One custom switch I use is a cumulative list from yesterday and today, all in one list with duplicates purged. The other switch I created is a moving cumulative list from today plus the previous six days. Why do I do this? Well, Tor's DNSEL using dig is too much overhead, compared to searching a sorted list on my servers. But the available exit node lists from the Tor directory are strange, to say the least. The list size from blutmagie.de can be as much as several hundred IPs different than the list from torproject.org, even within the same one-hour period. Moreover, they are extremely dynamic. While the current list is usually around 1100 IPs, the cumulative list from yesterday plus today is usually about 2600 unique IPs. The list from today plus the six previous days is anywhere from 4500 to 7500 unique IPs. I've been watching these numbers for over a year now -- take my word for it that what I'm describing is a consistent pattern, not some momentary fluke. I'm getting to the point where I'm tempted to offer my two exit node lists (yesterday plus today, and previous six days plus today) to the public. If I had more confidence in the lists currently available to the public, I wouldn't be tempted to do this. -- Daniel Brandt *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/ *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/
Re: Scroogle and Tor
Gregory Maxwell wrote: As far as performance goes, you can download a list of nodes which can reach a particular address at https://check.torproject.org/cgi-bin/TorBulkExitList.py?ip=1.2.3.4 but, these results have the same problem with omitted nodes that I mentioned. That's the torproject.org bulk list I've been using, alternating with the blutmagie.de list. When I download the torproject.org list I ask for exit nodes that can reach one of my servers. I alternate between asking for port 443 and port 80 on that server. Someone else emailed me directly: Seems like you could get a lot smarter about this and block successive queries from the same IP that happen less than a few seconds from each other. Difficult, because blutmagie.de and another high-traffic site account for about 20 percent of my total Tor requests. I have to exempt them from some of my screening if there's a chance of false positives. I'm already doing something like what you suggest, after exempting these two sites. It's normally turned off, but I try this first when I have a problem. I try other things too before blocking all exit nodes. Another problem is that search-engine use presents a special challenge. Often legitimate searchers fire off a few searches in quick succession. The input box is right there, and they may modify it just slightly and fire off another search. An extreme example of this is something I see several times a week outside of Tor (which is too slow to do this). Someone has a Scroogle search plugin out there that mimics an instant-search feature for every keystroke as you key in your search term. This is something Google introduced last year. But trying to do this on Scroogle is insane. Even if it works to the user's satisfaction, I consider this extremely abusive, and I block these IPs for a week as soon as I see it happening. The reason it's insane is that Scroogle has six servers, while Google has several hundred thousand servers. I wish these script kiddies would do the math first! -- Daniel Brandt *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/
Re: Scroogle and Tor
On Sun, 13 Feb 2011 14:09:56 -0500 (EST) scroo...@lavabit.com wrote: I've been fighting two different Tor users for a week. Each is apparently having a good time trying to see how quickly they can get results from Scroogle searches via Tor exit nodes. I've talked to a few services that do one of the following: - Run a Tor exit enclave, which would only allow exit through Tor to your webservers. There are a few services that run a tor client and simply block every IP in the consensus, except their exit enclave. - Run a hidden service. Due to the current state of hidden services, it'll slow down everything. - Run a tor exit enclave against one, non-load balanced server for tor users. If someone abuses it, the reality of slower response times is a self-enforcing feedback loop. Of course, this sucks for the non-abusers. - Rate limiting queries in the application. The Google solution of CAPTCHA. The Yahoo/Bing solution of throwing up a temporary error page when queries cross some threshold per IP address. -- Andrew pgp 0x74ED336B *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/
Re: Scroogle and Tor
On Sun, Feb 13, 2011 at 9:34 PM, Andrew Lewman and...@torproject.org wrote: I've talked to a few services that do one of the following: - Run a Tor exit enclave, which would only allow exit through Tor to your webservers. There are a few services that run a tor client and simply block every IP in the consensus, except their exit enclave. [snip] This one can be kind of lame, because some requests to an enclaved host (in particular, the first one always) will hit some random exit. Depending how you do the blocking this can give unexpected results. It would be nice if there were some roadmap to fixing this, since it really diminishes the usefulness of enclaves as a mechanism for reducing problems due to misbehaving exits. Likewise, the extra hop probably washes out a lot of the benefit of an enclave as a performance enhancement (though not as much as a hidden service). It can also be tricky to run an enclave when you DNS load-balancing (especially with multiple datacenters): You must have an 'apparent' Tor node on every IP that your DNS returns. *** To unsubscribe, send an e-mail to majord...@torproject.org with unsubscribe or-talkin the body. http://archives.seul.org/or/talk/