Re: How to Run High Capacity Tor Relays

2010-09-01 Thread John Case



Also, afaik, zero people in the wild are actively running Tor with any
crypto accelerator. May be a very painful process... I'm not really
interested in documenting it unless its proven to scale by actual use.
I want this document to end up with tested and reproduced results
only. You know, Science. Not computerscience ;)



There was a _very_ interesting, long and detailed discussion of this about 
1 year ago on this list.


I really do think some subset of that discussion should be included in 
your lore, at the very least the parts pertaining to the built-in crypto 
acceleration included in recent sparc CPUs, which appear to be the only 
non-painful way to make this work.


My impression was that a significant boost could be had by accelerating 
openssl using this on-chip features...

***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: How to Run High Capacity Tor Relays

2010-09-01 Thread coderman
On Wed, Sep 1, 2010 at 2:28 PM, John Case c...@sdf.lonestar.org wrote:
...
 I really do think some subset of that discussion should be included in your
 lore, at the very least the parts pertaining to the built-in crypto
 acceleration included in recent sparc CPUs, which appear to be the only
 non-painful way to make this work.

if you're running a high capacity relay you likely don't need hw
acceleration because:

a. you're on a fast server with relatively modern processor to get
into the high capacity game. assembly optimized crypto is pretty fast
on these systems.

b. the compression, buffer management, and other aspects of Tor are
just as significant as the crypto specific parts on such a server.

c. the crypto hw needed to be effective is expensive, at least a
grand, or inside specialized server processors you're unlikely to have
in your dedicated / leased server hardware.


this is not to say it isn't useful. it's useful in all kinds of ways
ranging from efficiency improvements, side channel attack resistance,
to entropy sources for strong session key / nonce generation.

however, i doubt hardware crypto will prove useful for anyone in the
top tier of relay capacity to drastically improve their throughput or
efficiency overall given the current architecture of Tor itself.

and, as mentioned, there have been a number of threads on the subject,
and widely expanded OpenSSL engine support added since last year for
those interested in experimenting with hw acceleration.

best regards,
***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: How to Run High Capacity Tor Relays

2010-09-01 Thread Jacob Appelbaum
On 09/01/2010 02:28 PM, John Case wrote:
 
 Also, afaik, zero people in the wild are actively running Tor with any
 crypto accelerator. May be a very painful process... I'm not really
 interested in documenting it unless its proven to scale by actual use.
 I want this document to end up with tested and reproduced results
 only. You know, Science. Not computerscience ;)
 
 
 There was a _very_ interesting, long and detailed discussion of this
 about 1 year ago on this list.
 
 I really do think some subset of that discussion should be included in
 your lore, at the very least the parts pertaining to the built-in
 crypto acceleration included in recent sparc CPUs, which appear to be
 the only non-painful way to make this work.
 
 My impression was that a significant boost could be had by accelerating
 openssl using this on-chip features...

If you're using a fast CPU, it's almost not worth the trouble to bother
with hardware acceleration.

All the best,
Jacob
***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/


Re: How to Run High Capacity Tor Relays

2010-08-25 Thread Mike Perry
I should have said this in my first post, but I believe that all
subsequent replies should go to tor-relays. This should be the last
post discussing technical details of relay operation on or-talk.


Thus spake coderman (coder...@gmail.com):

  net.ipv4.tcp_keepalive_time = 1200
 
 ^- who uses keepalive? :)

Hrmm, Tor does its own application-level keepalive. Perhaps that's how
this got merged in by confusion. Or maybe, like many of these, it was
just a blanket cut+and+paste move out of desperation to try to
increase capacity. The whole superset of voodoo thing.
 
  net.netfilter.nf_conntrack_tcp_timeout_established=7200
  net.netfilter.nf_conntrack_checksum=0
  net.netfilter.nf_conntrack_max=131072
  net.netfilter.nf_conntrack_tcp_timeout_syn_sent=15
 
 ^- best to just disable conntrack altogether if you can. -J NOTRACK in
 the raw table as appropriate.
 you're going to each up lots of memory with a decent nf|ip_conntrack_max
 ( check /proc/sys/net/ipv4/netfilter/ip_conntrack_max , etc )

Will this remove the ability to do PREROUTING DNAT rules? I know a lot
of Tor nodes forward ports and even IPs around.

Good suggestion though. Perhaps we should mention both options in the
final draft.

  [...]
 some dupes in here?
 
  net.ipv4.ip_forward=1
  ...
  net.ipv4.conf.default.forwarding=1
  net.ipv4.conf.default.proxy_arp = 1
 
 ^- BAD! this should not be enabled by default unless you're actually
 routing specifically to guest vm's or between interfaces or something.
 if you enable forwarding by default, someone may use you to relay some
 malicious traffic.

Oh shit, that is a relic of Mortiz's config. He is also planning to
provide VPN and VPS services. Good catch.

Also, does DNAT count as forwarding for the ip_forward option? 
 
  == Did I leave anything out? ==
 
  Well, did I?
 
 i'd love to see an sca6000 accelerated node.  been working with these
 recently but unfortunately they're allocated for other work...
 (most of the other crypto hw is going to be bus / implementation
 limited to less than what a beefy 64bit modern server can provide, so
 of little utility in this context.)

I'd love to hear Roger and Nick's comments on this, but isn't it
possible this might also bottleneck well before 1Gbit? I am worried it
may depend largely on the architecture of the card and our use of
openssl. Their docs claim up to 1Gbit but this could be using highly
parallelized processing, which tor cannot really do, as I understand
it.

Personally I think the hyperthreading option is the lowest hanging
fruit for maxing out a single Tor relay process for lowest cost.

Also, afaik, zero people in the wild are actively running Tor with any
crypto accelerator. May be a very painful process... I'm not really
interested in documenting it unless its proven to scale by actual use.
I want this document to end up with tested and reproduced results
only. You know, Science. Not computerscience ;)


-- 
Mike Perry
Mad Computer Scientist
fscked.org evil labs


pgpUMXxamWLCJ.pgp
Description: PGP signature


How to Run High Capacity Tor Relays

2010-08-24 Thread Mike Perry
After talking to Moritz and Olaf privately and asking them about their
nodes, and after running some experiments with some high capacity
relays, I've begun to realize that running a fast Tor relay is a
pretty black art, with a lot of ad-hoc practice. Only a few people
know how to do it, and if you just use Linux and Tor out of the box,
your relay will likely underperform on 100Mbit links and above.

In the interest of trying to help grow and distribute the network, my
ultimate plan is to try to collect all of this lore, use Science to
divine out what actually matters, and then write a more succinct blog
post about it.

However, that is a lot of work. It's also not totally necessary to do
all this work, when you can get a pretty good setup with a rough
superset of all of the ad-hoc voodoo. This post is thus about that
voodoo.

Hopefully others will spring forth from the darkness to dump their own
voodoo in this thread, as I suspect there is one hell of a lot of it
out there, some (much?) of which I don't yet know. Likewise, if any
blasphemous heretic wishes to apply Science to this voodoo, they
should yell out, Stand Back, I'm Doing Science! (at home please, not
on this list) and run some experiments to try to eliminate options
that are useless to Tor performance. Or cite academic research papers.
(But that's not Science, that's computerscience - which is a religion
like voodoo, but with cathedrals).

Anyway, on with the draft:


== Machine Specs ==

First, you want to run your OS in x64 mode because openssl should do
crypto faster in 64bit.  

Tor is currently not fully multithreaded, and tends not to benefit
beyond 2 cores per process. Even then, the benefit is still marginal
beyond just 1 core. 64bit Tor nodes require about one 2Ghz Xeon/Core2
core per 100Mbit of capacity.

Thus, to fill an 800Mbit link, you need at least a dual socket, quad
core cpu config.  You may be able to squeeze a full gigabit out of one
of these machines. As far as I know, no one has ever done this with
Tor, on any one machine.

The i7's also just came out in this form factor, and can do
hyperthreading (previous models may list 'ht' in cpuinfo, but actually
don't support it). This should give you a decent bonus if you set
NumCPUs to 2, since ht tends to work better with pure integer math
(like crypto). We have not benchmarked this config yet though, but I
suspect it should fill a gigabit link fairly easily, possibly
approaching 2Gbit.

At full capacity, exit node Tor processes running at this rate consume
about 500M of ram. You want to ensure your ram speed is sufficient,
but most newish hardware is good. Using on this chart:
https://secure.wikimedia.org/wikipedia/en/wiki/List_of_device_bandwidths#Memory_Interconnect.2FRAM_buses
you can do the math and see that with a dozen memcpys in each
direction, you come out needing DDR2 to be able to push 1Gbit full
duplex.

As far as ethernet cards, the Intel e1000e *should* be theoretically
good, but they seem to fail at properly irq balancing across multiple
CPUs on recent kernels, which can cause you to bottleneck at 100% CPU
on one core. At least that has been Moritz's experience. In our
experiments, the RTL-8169 works fine (once tweaked, see below).


== System Tweakscript Wibbles and Config Splatters ==

First, you want to ensure that you run no more than 2 Tor instances
per IP. Any more than this and clients will ignore them.

Next, paste the following smattering into the shell (or just read it
and make your own script):

# Set the hard limit of open file descriptors really high.
# Tor will also potentially run out of ports.
ulimit -SHn 65000

# Set the txqueuelen high, to prevent premature drops
ifconfig eth0 txqueuelen 2

# Tell our ethernet card (interrupt found from /proc/interrupts)
# to balance its IRQs across one whole CPU socket (4 cpus, mask 0f).
# You only want one socket for optimal ISR and buffer caching.
#
# Note that e1000e does NOT seem to obey this, but RTL-8169 will.
echo 0f  /proc/irq/17/smp_affinity

# Make sure you have auxiliary nameservers. I've seen many ISP
# nameservers fall over under load from fast tor nodes, both on our
# nodes and from scans. Or run caching named and closely monitor it.
echo nameserver 8.8.8.8  /etc/resolv.conf
echo nameserver 4.2.2.2  /etc/resolv.conf

# Load an amalgam of gigabit-tuning sysctls from:
# http://datatag.web.cern.ch/datatag/howto/tcp.html
# http://fasterdata.es.net/TCP-tuning/linux.html
# http://www.acc.umu.se/~maswan/linux-netperf.txt
# http://www.psc.edu/networking/projects/tcptune/#Linux
# and elsewhere...
# We have no idea which of these are needed yet for our actual use
# case, but they do help (especially the nf-contrack ones):
sysctl -p  EOF
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 2500
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
net.core.rmem_max = 1048575
net.core.wmem_max = 1048575
net.ipv4.ip_local_port_range = 1025 

Re: How to Run High Capacity Tor Relays

2010-08-24 Thread coderman
On Tue, Aug 24, 2010 at 8:27 AM, Mike Perry mikepe...@fscked.org wrote:
 ...
 # Set the hard limit of open file descriptors really high.
 # Tor will also potentially run out of ports.
 ulimit -SHn 65000

typically in /etc/security/limits.conf. i like to append:
*   softnofile  4096
*   hardnofile  65535

but on big servers use .25mm as hard limit. (Tor not this fd hungry,
64k is fine)


 # Load an amalgam of gigabit-tuning sysctls from:
 ...
 # We have no idea which of these are needed yet for our actual use
 # case, but they do help (especially the nf-contrack ones):

you probably want to save in /etc/sysctl.conf , then sysctl -p


 ...
 net.ipv4.tcp_rmem = 4096 87380 16777216
 net.ipv4.tcp_wmem = 4096 65536 16777216
 net.core.netdev_max_backlog = 2500
 net.ipv4.tcp_no_metrics_save = 1
 net.ipv4.tcp_moderate_rcvbuf = 1
 net.core.rmem_max = 1048575
 net.core.wmem_max = 1048575

^- these are important and useful



 net.ipv4.ip_local_port_range = 1025 61000

^- that's a little aggressive, better to set FIN timeout lower. i like
5000 to 65535 ephemeral port range


 net.ipv4.tcp_max_syn_backlog = 10240
 net.ipv4.tcp_fin_timeout = 30

^- i like a fin timeout of 3-4 seconds on a busy server, otherwise
you've got lots of resources tied up in sockets waiting to die...  Tor
not quite so volatile as some services, so perhaps 30 is fine.


 net.ipv4.tcp_keepalive_time = 1200

^- who uses keepalive? :)


 net.netfilter.nf_conntrack_tcp_timeout_established=7200
 net.netfilter.nf_conntrack_checksum=0
 net.netfilter.nf_conntrack_max=131072
 net.netfilter.nf_conntrack_tcp_timeout_syn_sent=15

^- best to just disable conntrack altogether if you can. -J NOTRACK in
the raw table as appropriate.
you're going to each up lots of memory with a decent nf|ip_conntrack_max
( check /proc/sys/net/ipv4/netfilter/ip_conntrack_max , etc )


 [...]
some dupes in here?

 net.ipv4.ip_forward=1
 ...
 net.ipv4.conf.default.forwarding=1
 net.ipv4.conf.default.proxy_arp = 1

^- BAD! this should not be enabled by default unless you're actually
routing specifically to guest vm's or between interfaces or something.
if you enable forwarding by default, someone may use you to relay some
malicious traffic.

were these cut and paste errors?  remember to disable forwarding
first, before tuning other parameters, as changing this value will
reset some others back to defaults. (!!)


 net.ipv4.tcp_syncookies = 1

^- not usually worth the overhead?


 net.ipv4.conf.all.rp_filter = 1

^- note that you need to be precise with your routing metrics and such
for multi-homed with rp_filter enabled. also, this costs resources,
and if you can avoid it, do so.


 net.ipv4.conf.default.send_redirects = 1
 net.ipv4.conf.all.send_redirects = 0

^- don't know if these are too useful either. i prefer to limit ICMP
beyond this. (perhaps related to forwarding defaults above.) Ex:
echo 1  /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
echo 1  /proc/sys/net/ipv4/icmp_echo_ignore_all
echo 0  /proc/sys/net/ipv4/conf/all/accept_redirects
echo 1  /proc/sys/net/ipv4/icmp_ignore_bogus_error_responses





 == Did I leave anything out? ==

 Well, did I?

i'd love to see an sca6000 accelerated node.  been working with these
recently but unfortunately they're allocated for other work...
(most of the other crypto hw is going to be bus / implementation
limited to less than what a beefy 64bit modern server can provide, so
of little utility in this context.)

best regards,
***
To unsubscribe, send an e-mail to majord...@torproject.org with
unsubscribe or-talkin the body. http://archives.seul.org/or/talk/