Re: Scroogle and Tor

2011-02-15 Thread Robert Hogan
On Tuesday 15 February 2011 05:20:21 Mike Perry wrote:
 
 I was under the impression that we hacked it to also be memory-only,
 though. But you're right, if I toggle Torbutton to clear my cache,
 Polipo's is still there...

The polipo shipped in the tor bundles has the cache turned off, but any 
non-Windows users will tend to use the polipo shipped by their distro - 
with caching turned on.

***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: Scroogle and Tor

2011-02-14 Thread Mike Perry
Thus spake Matthew (pump...@cotse.net):

 On 13/02/11 19:09, scroo...@lavabit.com wrote:
 I've been fighting two different Tor users for a week. Each is
 apparently having a good time trying to see how quickly they
 can get results from Scroogle searches via Tor exit nodes.
 The fastest I've seen is about two per second. Since Tor users
 are only two percent of all Scroogle searches, I'm not adverse
 to blocking all Tor exits for a while when all else fails.
 These two Tor users were rotating their search terms, and one
 also switched his user-agent once. You can see why I might be
 tempted to throw my block all Tor switch on occasion --
 sometimes there's no other way to convince the bad guy that
 he's not going to succeed.
 
 For the less than knowledgeable people amongst us (e.g me) who want to 
 learn a bit more: what was the rationale for those two Tor users doing what 
 they did?  What do they get from it?

I second this.

Daniel,

If you can find a way to fingerprint these bots, my suggestion would
be to observe the types of queries they are running (perhaps for some
of their earlier runs from when you could ban them by user agent?).

One of the things Google does is actually decide your 'Captchaness'
based on the content of your queries. Well, at least I suspect that's
what they are doing, because I have been able to more reliably
reproduce torbutton Captcha-related bugs when I try hard to write
queries like robots that are looking for php sites to exploit.

I would love to hear more about the types of scrapers that abuse Tor.
Or rather, I would like to see if someone can at least identify
rational behavior behind scrapers that abuse Tor. Some of it could
also be misdirected malware that is operating from within Torified
browsers. Some of it could also be deliberately torified malware.

Google won't tell us any of this, obviously ;).


-- 
Mike Perry
Mad Computer Scientist
fscked.org evil labs


pgpVxq8YphoPj.pgp
Description: PGP signature


Re: Scroogle and Tor

2011-02-14 Thread Jim

scroo...@lavabit.com wrote:

I've been fighting two different Tor users for a week. Each is
apparently having a good time trying to see how quickly they
can get results from Scroogle searches via Tor exit nodes. 
[snip]


As the person who (recently) raised the question about the availability 
of Scroogle via Tor, I want to thank you both for running Scroogle and 
for coming on this list to explain what happened.  I also apologize to 
the list for not mentioning that Scroogle is once again available via 
Tor.  (I discovered that and meant to publish that fact aprox. 24 hours 
ago.)


You are obviously much more knowledgable about network issues than I am 
so I will leave it to others to advise you about possible mitigations 
for your problems.  It is a real shame about the script kiddies, but 
such is the world we live in.


Jim


***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: Scroogle and Tor

2011-02-14 Thread scroogle
Some have wondered why anyone would want to abuse Scroogle
using Tor. Apart from some malicious types that may be
doing it for their own amusement, it looks to me like they
are trying to datamine Google -- arguably the largest,
most diverse database on the planet.

If you can manage to run a script 24/7 that datamines
Google, you can monetize your results. Search engine
optimizers would like to be able to do this. So would
various directory builders.

Doing it by scraping google.com directly is not easy.
Scroogle provides 100 links of organic results per
request, with less than one-half the byte-bloat that
Google delivers for the same links and snippets. It is
also much easier to parse Scroogle's simple output page
than it is to parse Google's output page.

I spend a couple hours per day blocking abusers. A huge
amount of this is done through a couple dozen monitoring
programs I've written, but for the most part these
programs provide candidates for blocking only, and
my wetware is needed to make the final determination.

My efforts to counter abuse occasionally cause some
programmers to consider using Tor to get Scroogle's
results. About a year ago I began requiring any and all
Tor searches at Scroogle to use SSL. Using SSL is always
a good idea, but the main reason I did this is that the
SSL requirement discouraged script writers who didn't
know how to add this to their scripts. This policy
helped immensely in cutting back on the abuse I was
seeing from Tor.

Now I'm seeing script writers who have solved the SSL
problem. This leaves me with the user-agent, the search
terms, and as a last resort, blocking Tor exit nodes.
If they vary their search terms and user-agents, it can
take hours to analyze patterns and accurately block them
by returning a blank page. That's the way I prefer to do
it, because I don't like to block Tor exit nodes. Those
who are most sympathetic with what Tor is doing are also
sympathetic with what Scroogle is doing. There's a lot of
collateral damage associated with blocking Tor exit nodes,
and I don't want to alienate the Tor community except as
a last resort.

One reason why Scroogle has lasted for more than six
years is that we are nonprofit, and Google knows by now
that I don't tolerate abuse. My job is to stop the abuser
before Scroogle passes their search terms to Google.
Abusers who use Tor make this more difficult for me.
Blocking an IP address is easy, but blocking Tor abusers
without alienating other Tor users is more complex.

-- Daniel Brandt



***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: Scroogle and Tor

2011-02-14 Thread thecarp
On 02/14/2011 06:29 PM, scroo...@lavabit.com wrote:
 Some have wondered why anyone would want to abuse Scroogle
 using Tor. Apart from some malicious types that may be
 doing it for their own amusement, it looks to me like they
 are trying to datamine Google -- arguably the largest,
 most diverse database on the planet.

Makes a lot of sense. Actually, can hardly blame them for wanting to
mine the data. Of course, you make it pretty easily available, as you
detail. I can see why this starts to present a problem.
 I spend a couple hours per day blocking abusers. A huge
 amount of this is done through a couple dozen monitoring
 programs I've written, but for the most part these
 programs provide candidates for blocking only, and
 my wetware is needed to make the final determination.

Ouch, that really sucks... time like that adds up fast.

 Now I'm seeing script writers who have solved the SSL
 problem. This leaves me with the user-agent, the search
 terms, and as a last resort, blocking Tor exit nodes.
 If they vary their search terms and user-agents, it can
 take hours to analyze patterns and accurately block them
 by returning a blank page. That's the way I prefer to do
 it, because I don't like to block Tor exit nodes. Those
 who are most sympathetic with what Tor is doing are also
 sympathetic with what Scroogle is doing. There's a lot of
 collateral damage associated with blocking Tor exit nodes,
 and I don't want to alienate the Tor community except as
 a last resort.


Well...google uses the captcha system. Hard to say how well that works.
I doubt anything too simple is going to work here, for many reasons,
including the ones that you specify. How about this... we know you can
(mostly reliably) detect tor exits.

I think you have your goals wrong. You don't need to stop the scripts
from getting to google, even google can't stop that on their own site.
What you need is to make abusive use unprofitable on a scale that matters.

Tor users care about their privacy right... but you need a way to
differentiate them. So how about a temporary registration system? I get
sent to a page with a captcha (or two kinds even). If I pass, then I get
a token (set in a cookie, or put in the query string) that lets me do
searches. Maybe I can set when it should expire (up to a max) maybe
put in a 30 second timeout before it becomes active. (slow them down
some more)... maybe limit the rate per ip over time for registrations?

Secondly, have you considered poisoning their stream? If you detect an
obvious abusive script, return randomized cached results. Ruining their
work, rather than just slowing them down, might convince them to move on
and try somewhere else. It is a thought anyway.

 One reason why Scroogle has lasted for more than six
 years is that we are nonprofit, and Google knows by now
 that I don't tolerate abuse. My job is to stop the abuser
 before Scroogle passes their search terms to Google.
 Abusers who use Tor make this more difficult for me.
 Blocking an IP address is easy, but blocking Tor abusers
 without alienating other Tor users is more complex.

It will be sad to see tor users lose your service (I actually had only
heard the name before this thread, very curious to check it out now).

-Steve

***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: Scroogle and Tor

2011-02-14 Thread Mike Perry
Thus spake scroo...@lavabit.com (scroo...@lavabit.com):

 My efforts to counter abuse occasionally cause some
 programmers to consider using Tor to get Scroogle's
 results. About a year ago I began requiring any and all
 Tor searches at Scroogle to use SSL. Using SSL is always
 a good idea, but the main reason I did this is that the
 SSL requirement discouraged script writers who didn't
 know how to add this to their scripts. This policy
 helped immensely in cutting back on the abuse I was
 seeing from Tor.
 
 Now I'm seeing script writers who have solved the SSL
 problem. This leaves me with the user-agent, the search
 terms, and as a last resort, blocking Tor exit nodes.
 If they vary their search terms and user-agents, it can
 take hours to analyze patterns and accurately block them
 by returning a blank page. That's the way I prefer to do
 it, because I don't like to block Tor exit nodes. Those
 who are most sympathetic with what Tor is doing are also
 sympathetic with what Scroogle is doing. There's a lot of
 collateral damage associated with blocking Tor exit nodes,
 and I don't want to alienate the Tor community except as
 a last resort.

Great, now that we know the motivations of the scrapers and a history
of the arms race so far, it becomes a bit easier to try to do some
things to mitigate their efforts. I particularly like the idea of
feeding them random, incorrect search results when you can fingerprint
them.


If you want my suggestions for next steps in the arms race for this,
(having written some benevolent scrapers and web scanners myself), it
would actually be to do things that require your adversary to
implement and load more and more bits of a proper web browser into
their crawlers for them to succeed in properly issuing queries to you.

Some examples:

1. A couple layers of crazy CSS.

If you use CSS style sheets that fetch other randomly generated and
programmatically controlled style elements that are also keyed to the
form submit for the search query (via an extra hidden parameter or
something that is their hash), then you can verify on your server side
that a given query also loaded sufficient CSS to be genuine. 

The problem with this is it will mess with people who use your search
plugin or search keywords, but you could also do it in a brief landing
page that is displayed *after* the query, but before a 302 or
meta-refresh to actual results, for problem IPs.

2. Storing identifiers in the cache

http://crypto.stanford.edu/sameorigin/safecachetest.html has some PoC
of this. Torbutton protects against long-term cache identifiers, but
for performance reasons the memory cache is enabled by default, so you
could use this to differentiate crawlers who do not properly obey all
brower caching sematics. Caching is actually pretty darn hard to get
right, so there's probably quite a bit more room here than just plain
identifiers.

3. Javascript proof of work

If the client supports javascript, you can have them factor some
medium-sized integers and post the factorization with the query
string, to prove some level of periodic work. The factors could be
stored in cookies and given a lifetime. The obvious downside of this
is that I bet a fair share of your users are running NoScript, or
prefer to disable js and cookies.


Anyways, thanks for your efforts with Scroogle. Hopefully the above
ideas are actually easy enough to implement on your infrastructure to
make it worth your while to use for all problem IPs, not just Tor.

-- 
Mike Perry
Mad Computer Scientist
fscked.org evil labs


pgpDQruQ8zLhC.pgp
Description: PGP signature


Re: Scroogle and Tor

2011-02-14 Thread Robert Ransom
On Mon, 14 Feb 2011 20:19:50 -0800
Mike Perry mikepe...@fscked.org wrote:

 2. Storing identifiers in the cache
 
 http://crypto.stanford.edu/sameorigin/safecachetest.html has some PoC
 of this. Torbutton protects against long-term cache identifiers, but
 for performance reasons the memory cache is enabled by default, so you
 could use this to differentiate crawlers who do not properly obey all
 brower caching sematics. Caching is actually pretty darn hard to get
 right, so there's probably quite a bit more room here than just plain
 identifiers.

Polipo monkey-wrenches Torbutton's protection against long-term cache
identifiers.


Robert Ransom


signature.asc
Description: PGP signature


Re: Scroogle and Tor

2011-02-14 Thread Mike Perry
Thus spake Robert Ransom (rransom.8...@gmail.com):

 On Mon, 14 Feb 2011 20:19:50 -0800
 Mike Perry mikepe...@fscked.org wrote:
 
  2. Storing identifiers in the cache
  
  http://crypto.stanford.edu/sameorigin/safecachetest.html has some PoC
  of this. Torbutton protects against long-term cache identifiers, but
  for performance reasons the memory cache is enabled by default, so you
  could use this to differentiate crawlers who do not properly obey all
  brower caching sematics. Caching is actually pretty darn hard to get
  right, so there's probably quite a bit more room here than just plain
  identifiers.
 
 Polipo monkey-wrenches Torbutton's protection against long-term cache
 identifiers.

I hate polipo. I've been trying ignore it until it fucking dies. But
it's like a zombie that just won't stop gnawing on our brains. Worse,
a crack smoking zombie that got us all addicted to it through second
hand crack smoke. Or something. But hey, it's better than privoxy.
Maybe?

I was under the impression that we hacked it to also be memory-only,
though. But you're right, if I toggle Torbutton to clear my cache,
Polipo's is still there...


-- 
Mike Perry
Mad Computer Scientist
fscked.org evil labs


pgpgDTEhULdw5.pgp
Description: PGP signature


Re: Scroogle and Tor

2011-02-13 Thread Gregory Maxwell
On Sun, Feb 13, 2011 at 2:09 PM,  scroo...@lavabit.com wrote:
[snip]
 I'm getting to the point where I'm tempted to offer my two
 exit node lists (yesterday plus today, and previous six days
 plus today) to the public. If I had more confidence in the
 lists currently available to the public, I wouldn't be
 tempted to do this.

You should. The current public exit service is demonstrably incorrect.

Although it's also important to know why it's incorrect.

For example, one reason that the DNSEL is incorrect is a side effect
of that fact that they are tested to see what address they _really_
exit from. Sometimes an exit is placed behind some proxy and the
address that it claims to be is not the address anyone else sees.
But— if an exit has a policy so narrow that it can not be tested by
this process then it will not show up in the DNSEL results.

So, e.g. if I ran a scroogle only exit, it wouldn't be in the DNSEL
results.  I'm pretty sure this is the wrong failure mode for the
testing process.

Though this issue means that your non-testing based results will also
be incorrect, just in another way.

There may also be other issues with the DNSEL result which I am
unaware of. The daily/weekly cycle part just sounds like the pattern
of nodes hitting their transfer limits and shutting off.  Perhaps the
DNSEL is promptly delisting these nodes when there should be a hold-up
because the DNSEL results are cached.

As far as performance goes, you can download a list of nodes which can
reach a particular address at
https://check.torproject.org/cgi-bin/TorBulkExitList.py?ip=1.2.3.4
but, these results have the same problem with omitted nodes that I
mentioned.

As far as the annoying requests from tor goes, it would be better to
subject them to a captcha than to block them completely. Then again,
the big reason people use scroogle via tor is, as I understand it, to
avoid the annoying captchas that google often subjects tor exits to...
***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: Scroogle and Tor

2011-02-13 Thread Matthew



On 13/02/11 19:09, scroo...@lavabit.com wrote:

I've been fighting two different Tor users for a week. Each is
apparently having a good time trying to see how quickly they
can get results from Scroogle searches via Tor exit nodes.
The fastest I've seen is about two per second. Since Tor users
are only two percent of all Scroogle searches, I'm not adverse
to blocking all Tor exits for a while when all else fails.
These two Tor users were rotating their search terms, and one
also switched his user-agent once. You can see why I might be
tempted to throw my block all Tor switch on occasion --
sometimes there's no other way to convince the bad guy that
he's not going to succeed.



For the less than knowledgeable people amongst us (e.g me) who want to 
learn a bit more: what was the rationale for those two Tor users doing what 
they did?  What do they get from it?


Incidentally, I use the SSL version of Scroogle (sometimes with Tor, 
sometimes without) because a) no CAPTCHAs b) I appreciate your 
privacy-minded ethos (ideology).  It would be a shame if you had to block 
Tor users because of an abusive minority.



When a nonprofit such as the Tor Project or Scroogle offers a
public service, the script kiddies should have more respect.
I don't expect everyone to donate to Tor and Scroogle, but I
do expect that no one will steal time and effort from us.

By the way, my block all Tor options for my Scroogle servers
use an expanded definition of which IPs are Tor exit nodes.
I pull the blutmagie.de exit node list, or the torproject.org
exit node list (both port 80 and port 443) once per half hour,
alternating between the two sites.

One custom switch I use is a cumulative list from yesterday and
today, all in one list with duplicates purged. The other switch
I created is a moving cumulative list from today plus the
previous six days.

Why do I do this? Well, Tor's DNSEL using dig is too much
overhead, compared to searching a sorted list on my servers.
But the available exit node lists from the Tor directory are
strange, to say the least. The list size from blutmagie.de can
be as much as several hundred IPs different than the list from
torproject.org, even within the same one-hour period. Moreover,
they are extremely dynamic. While the current list is usually
around 1100 IPs, the cumulative list from yesterday plus today
is usually about 2600 unique IPs. The list from today plus the
six previous days is anywhere from 4500 to 7500 unique IPs.
I've been watching these numbers for over a year now -- take
my word for it that what I'm describing is a consistent
pattern, not some momentary fluke.

I'm getting to the point where I'm tempted to offer my two
exit node lists (yesterday plus today, and previous six days
plus today) to the public. If I had more confidence in the
lists currently available to the public, I wouldn't be
tempted to do this.

-- Daniel Brandt



***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: Scroogle and Tor

2011-02-13 Thread scroogle
 Gregory Maxwell wrote:

 As far as performance goes, you can download a list of nodes which can
 reach a particular address at
 https://check.torproject.org/cgi-bin/TorBulkExitList.py?ip=1.2.3.4
 but, these results have the same problem with omitted nodes that I
 mentioned.

That's the torproject.org bulk list I've been using, alternating with
the blutmagie.de list. When I download the torproject.org list I ask
for exit nodes that can reach one of my servers. I alternate between
asking for port 443 and port 80 on that server.

 Someone else emailed me directly:

 Seems like you could get a lot smarter about this and block successive
 queries from the same IP that happen less than a few seconds from each
 other.

Difficult, because blutmagie.de and another high-traffic site account for
about 20 percent of my total Tor requests. I have to exempt them from some
of my screening if there's a chance of false positives. I'm already doing
something like what you suggest, after exempting these two sites. It's
normally turned off, but I try this first when I have a problem. I try
other things too before blocking all exit nodes.

Another problem is that search-engine use presents a special challenge.
Often legitimate searchers fire off a few searches in quick succession.
The input box is right there, and they may modify it just slightly and
fire off another search.

An extreme example of this is something I see several times a week
outside of Tor (which is too slow to do this). Someone has a Scroogle
search plugin out there that mimics an instant-search feature for every
keystroke as you key in your search term. This is something Google
introduced last year. But trying to do this on Scroogle is insane.
Even if it works to the user's satisfaction, I consider this extremely
abusive, and I block these IPs for a week as soon as I see it happening.
The reason it's insane is that Scroogle has six servers, while Google
has several hundred thousand servers. I wish these script kiddies would
do the math first!

-- Daniel Brandt



***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: Scroogle and Tor

2011-02-13 Thread Andrew Lewman
On Sun, 13 Feb 2011 14:09:56 -0500 (EST)
scroo...@lavabit.com wrote:

 I've been fighting two different Tor users for a week. Each is
 apparently having a good time trying to see how quickly they
 can get results from Scroogle searches via Tor exit nodes.

I've talked to a few services that do one of the following:

- Run a Tor exit enclave, which would only allow exit through Tor to
  your webservers.  There are a few services that run a tor client and
  simply block every IP in the consensus, except their exit enclave.

- Run a hidden service.  Due to the current state of hidden services,
  it'll slow down everything.

- Run a tor exit enclave against one, non-load balanced server for tor
  users. If someone abuses it, the reality of slower response times is a
  self-enforcing feedback loop. Of course, this sucks for the
  non-abusers.

- Rate limiting queries in the application.  The Google solution of
  CAPTCHA. The Yahoo/Bing solution of throwing up a temporary error
  page when queries cross some threshold per IP address.

-- 
Andrew
pgp 0x74ED336B
***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: Scroogle and Tor

2011-02-13 Thread Gregory Maxwell
On Sun, Feb 13, 2011 at 9:34 PM, Andrew Lewman and...@torproject.org wrote:
 I've talked to a few services that do one of the following:

 - Run a Tor exit enclave, which would only allow exit through Tor to
  your webservers.  There are a few services that run a tor client and
  simply block every IP in the consensus, except their exit enclave.
[snip]

This one can be kind of lame, because some requests to an enclaved
host (in particular, the first one always) will hit some random exit.
Depending how you do the blocking this can give unexpected results.

It would be nice if there were some roadmap to fixing this, since it
really diminishes the usefulness of enclaves as a mechanism for
reducing problems due to misbehaving exits. Likewise, the extra hop
probably washes out a lot of the benefit of an enclave as a
performance enhancement (though not as much as a hidden service).

It can also be tricky to run an enclave when you DNS load-balancing
(especially with multiple datacenters): You must have an 'apparent'
Tor node on every IP that your DNS returns.
***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/