On Tue, Oct 10, 2023 at 03:57:09PM +0200, Willy Tarreau wrote:
> On Tue, Oct 10, 2023 at 03:49:21PM +0200, Willy Tarreau wrote:
> > > Seems like a clever update to the "good old" h2 multiplexing abuse
> > > vectors:
> > > 1. client opens a lot of H2 streams on a connection
> > > 2. Spams some requests
> > > 3. immediately sends h2 RST frames for all of them
> > > 4. Go back to 1. repeat.
> >
> > Yes, precisely one of those I tested last week among the usual approchaes
> > consisting in creating tons of streams while staying within the protocol's
> > validity limits. The only thing it did was to detect the pool issue I
> > mentioned in the dev7 announce.
> >
> > > The idea being to cause resource exhaustion on the server/proxy at least
> > > when it allocates stream related buffers etc, and the underlying server
> > > too
> > > since it likely sees the requests before they get cancelled.
> > >
> > > Looking at HAProxy I'd like to know if someone's aware of a decent
> > > mitigation option?
> >
> > We first need to check if at all we're affected, since we keep a count
> > of the attached streams for precisely this case and we refrain from
> > processing headers frames when we have too many streams, so normally
> > the mux will pause waiting for the upper layers to close.
>
> So at first glance we indeed addressed this case in 2018 (1.9-dev)
> with this commit:
>
> f210191dc ("BUG/MEDIUM: h2: don't accept new streams if conn_streams are
> still in excess")
>
> It was incomplete by then an later refined, but the idea is there.
> But I'll try to stress that area again to see.
So I was bored by wasting my time trying to harm the process from scripts,
I finally wrote a small program that does the same but much faster. For
now it's boring as well. In short:
- the concurrency issue was addressed 5 years ago with the commit
above so all maintained version are immune to this. The principle
is that each H2 connection knows both the number of protocol-level
streams attached to them but also application-level streams, and
it's that one that enforces the limit, preventing from processing
more requests until the number of active streams is within the limit
again. In the worst case (i.e. if the attacker downloads and its
window updates cannot enter anymore), the streams will simply time
out then the connection, like on a single non-multiplexed connection
so nothing new here. It also means that no more than the configured
limit of streams per connection will reach the hosted application
at once.
- the risk of CPU usage that was also mentioned is not much relevant
either. These aborted requests actually cost less CPU than completed
ones, and on my laptop I found that I would reach up to 100-150k req/s
per core (depending on CPU thermal throttling) which is perfectly
within what we normally observe with a standard h2load injection.
With less streams I could even reach 1 million requests per second
total, because they were aborted before being turned into a regular
stream, so the load was essentially between haproxy and the client.
So at this point I'm still failing to find any case where this attack
hurts haproxy more than any of the benchmarks we're routinely inflicting
it, given that it acts exactly like a client configured with a short
timeout (e.g. if you configure haproxy with "timeout server 1" and
have an h2 server, you will basically get the same traffic pattern).
If you want to block some annoying testers who would fill your logs
in the next few days, I checked that the following works fine here
(set the limit to the max number of requests per 10 seconds you want
to accept, e.g. 1000 below, or 100/s to keep a large margin):
tcp-request connection track-sc0 src
http-request reject if { sc0_http_req_rate gt 1000 }
stick-table type ip size 1m store http_req_rate(10s)
It's even possible to play with "set-log-level silent" to avoid logging
them, and you may even block new connections at the TCP layer. But
for now if you site requires any of this, I can't see how it has not
experienced weekly outages from standard attacks. Note that when I'm
saying that it's not because your server can process 2 million req/s
that you have to use it at that speed on a single machine, it's exactly
to keep that type of comfortable margin.
I tested the client on an EPYC 74F3 server (24 cores at 3 GHz, the one
that I demoed at haproxyconf) and it handles 800k req/s at saturation,
spending most of its time in the weighted roundrobin lock (it reaches
850k with random so pretty much standard for this machine), and perf
top looks good:
Samples: 9M of event 'cycles', 4000 Hz, Event count (approx.): 864711451063
lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
4.11% haproxy [.] process_stream
3.29% haproxy [.] srv_add_to_idle_list
3.02% haproxy [.] conn_backend_get.isra.0
2.61% haproxy [.] back_try_conn_req
2.46% haproxy [.] stream_set_backend
1.99% haproxy [.] chash_get_server_hash
1.75% haproxy [.] assign_server
1.67% [kernel] [k] fetch_pte.isra.0
1.67% [kernel] [k] ice_napi_poll
1.44% haproxy [.] sess_change_server
1.39% haproxy [.] stream_update_time_stats
1.35% [kernel] [k] iommu_map_page
1.31% haproxy [.] http_wait_for_request
1.24% [kernel] [k] acpi_processor_ffh_cstate_enter
1.23% haproxy [.] __pool_alloc
1.22% [kernel] [k] skb_release_data
1.17% [kernel] [k] tcp_ack
1.13% [kernel] [k] tcp_recvmsg
1.09% [kernel] [k] copy_user_generic_string
0.99% [kernel] [k] __fget_light
I also compared the number of calls to the different functions inside
the process under attack and under h2load. They're pretty much identical:
Attack (24 clients, stopped at ~2.2M req):
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks
{on|auto|off}
Memory usage profiling : off # set profiling memory
{on|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
process_stream 2258751 14.59s 6.457us 46.20m 1.227ms
<- sc_notify@src/stconn.c:1141 task_wakeup
sc_conn_io_cb 2257588 866.4ms 383.0ns 7.043m 187.2us
<- sc_app_chk_rcv_conn@src/stconn.c:780 tasklet_wakeup
process_stream 2253237 31.33s 13.90us 47.91m 1.276ms
<- stream_new@src/stream.c:578 task_wakeup
h1_io_cb 2250867 6.203s 2.755us 43.14s 19.16us
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
sc_conn_io_cb 1436248 1.087s 756.0ns 49.17s 34.23us
<- h1_wake_stream_for_recv@src/mux_h1.c:2578 tasklet_wakeup
h1_io_cb 247901 69.24ms 279.0ns 31.15s 125.7us
<- h1_takeover@src/mux_h1.c:4147 tasklet_wakeup
h2_io_cb 200187 4.025s 20.11us 27.02s 135.0us
<- h2c_restart_reading@src/mux_h2.c:737 tasklet_wakeup
sc_conn_io_cb 161965 124.1ms 766.0ns 20.77s 128.2us
<- h2s_notify_recv@src/mux_h2.c:1240 tasklet_wakeup
h2_io_cb 70176 1.483s 21.14us 9.021s 128.5us
<- h2_snd_buf@src/mux_h2.c:6675 tasklet_wakeup
h1_io_cb 6126 8.775ms 1.432us 16.69s 2.725ms
<- sock_conn_iocb@src/sock.c:854 tasklet_wakeup
sc_conn_io_cb 6115 46.96ms 7.679us 22.63s 3.700ms
<- h1_wake_stream_for_send@src/mux_h1.c:2588 tasklet_wakeup
h1_timeout_task 4316 832.9us 192.0ns 4.955s 1.148ms
<- h1_release@src/mux_h1.c:1045 task_wakeup
accept_queue_process 12 133.6us 11.13us 1.449ms 120.8us
<- listener_accept@src/listener.c:1469 tasklet_wakeup
h1_io_cb 11 3.006ms 273.3us 4.930us 448.0ns
<- conn_subscribe@src/connection.c:736 tasklet_wakeup
h2_io_cb 6 405.4us 67.56us 4.555ms 759.2us
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
task_run_applet 1 - - 233.9us 233.9us
<- sc_applet_create@src/stconn.c:502 appctx_wakeup
sc_conn_io_cb 1 11.86us 11.86us 1.373us 1.373us
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
task_run_applet 1 411.0ns 411.0ns 5.311us 5.311us
<- sc_app_shut_applet@src/stconn.c:911 appctx_wakeup
h2load with 24 clients and approx same number of requests:
$ socat - /tmp/sock1 <<< "show profiling"
Per-task CPU profiling : on # set profiling tasks
{on|auto|off}
Memory usage profiling : off # set profiling memory
{on|off}
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
process_stream 2261040 13.42s 5.933us 34.49m 915.2us
<- sc_notify@src/stconn.c:1141 task_wakeup
sc_conn_io_cb 2259306 657.0ms 290.0ns 5.251m 139.4us
<- sc_app_chk_rcv_conn@src/stconn.c:780 tasklet_wakeup
h1_io_cb 2258770 4.721s 2.090us 30.58s 13.54us
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
process_stream 2258252 29.02s 12.85us 34.76m 923.5us
<- stream_new@src/stream.c:578 task_wakeup
sc_conn_io_cb 1690412 1.102s 652.0ns 59.93s 35.45us
<- h1_wake_stream_for_recv@src/mux_h1.c:2578 tasklet_wakeup
h2_io_cb 183444 2.820s 15.37us 28.68s 156.3us
<- h2_snd_buf@src/mux_h2.c:6675 tasklet_wakeup
h2_io_cb 149856 203.4ms 1.357us 496.5ms 3.313us
<- h2c_restart_reading@src/mux_h2.c:737 tasklet_wakeup
h2_io_cb 72164 2.100s 29.10us 7.894s 109.4us
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
h1_io_cb 4805 1.579ms 328.0ns 573.0ms 119.3us
<- h1_takeover@src/mux_h1.c:4147 tasklet_wakeup
sc_conn_io_cb 2288 15.37ms 6.717us 24.89s 10.88ms
<- h1_wake_stream_for_send@src/mux_h1.c:2588 tasklet_wakeup
h1_io_cb 2288 2.436ms 1.064us 20.22s 8.838ms
<- sock_conn_iocb@src/sock.c:854 tasklet_wakeup
h1_timeout_task 222 95.21us 428.0ns 750.5ms 3.381ms
<- h1_release@src/mux_h1.c:1045 task_wakeup
h2_timeout_task 24 7.914us 329.0ns 314.9us 13.12us
<- h2_release@src/mux_h2.c:1146 task_wakeup
accept_queue_process 9 75.00us 8.333us 416.8us 46.31us
<- listener_accept@src/listener.c:1469 tasklet_wakeup
h1_io_cb 2 124.6us 62.28us 1.785us 892.0ns
<- conn_subscribe@src/connection.c:736 tasklet_wakeup
task_run_applet 1 431.0ns 431.0ns 5.631us 5.631us
<- sc_app_shut_applet@src/stconn.c:911 appctx_wakeup
task_run_applet 1 - - 1.563us 1.563us
<- sc_applet_create@src/stconn.c:502 appctx_wakeup
sc_conn_io_cb 1 13.80us 13.80us 1.864us 1.864us
<- sock_conn_iocb@src/sock.c:875 tasklet_wakeup
I can't differentiate them, most of the activity is at the application
layer (process_stream).
As Tristan mentioned, lowering tune.h2.be.max-concurrent-streams may
also slow them down but it will also slow down some sites with many
objects (think shops with many images). For a long time the high
parallelism of H2 was sold as a huge differentiator, I don't feel like
starting to advertise lowering it now.
Also, please note that when it comes to anything past the reverse-proxy,
there's no difference between the attack over a single connection and
sending 10 times less traffic over 10 connections (i.e. h2load), the
total number of streams remains the same so in any case, any remediation
based on lowering the number of streams per connection just calls for
increasing the number of connections for the client.
Now of course if some find corner cases that affect them, I'm all ears
(and we can even discuss them privately if really needed). But I think
that this issue essentially depends on the components architecture, some
will eat more CPU, others more RAM, etc.
There are lots of other interesting attacks on the H2 protocol, that
can be triggered just with a regular client with low timeouts, with low
stream windows (use h2load -w 1 to have fun), zero-window during transfers,
and even playing with one-byte continuation frames that may force some
components to perform reallocations and copies. But most of them depend
on the implementation and on the attacker and were discussed in great
lengths during the protocol design 10 years ago so that the cost remains
balanced between the attacker and the target. In the end, H2 is not much
robust but each implementation has certain possibilities to cover some
of the limitations and these differ due to many architectural constraints.
The good point in this is that this will probably make more people want
to reconsider H3/QUIC if they don't trust their products anymore :-)
Willy