On 02/06/2012 11:35 AM, Gregory Farnum wrote:
On Mon, Feb 6, 2012 at 10:20 AM, Jim Schutt<[email protected]> wrote:On 02/06/2012 10:22 AM, Yehuda Sadeh Weinraub wrote:On Mon, Feb 6, 2012 at 8:20 AM, Jim Schutt<[email protected]> wrote:The above suggests to me that the slowdown is a result of requests not getting submitted at the same rate as when things are running well.Yeah, it really looks like that. My suggestions wouldn't help there. I do see that when things go well the number of writes per device is capped at ~200 writes per second and the throughput per device is ~100MB/sec. Is 100MB/sec the expected device throughput?Pretty much, at least for the outer tracks on a drive. I've seen ~108 MB/s with dd to a block device. Also, I've got 8 drives per SAS adapter with 6 Gb/s links, so it seems unlikely to me that my disk subsystem is any sort of significant bottleneck.Well, you might try changing your throttling settings on the OSDs. ms_dispatch_throttle_bytes defaults to 100<<20 (100MB) and is used for throttling dispatch; osd_max_client_bytes defaults to 500<<20 (500MB) and is used to limit the amount of client data in memory (ie; messages are included in this throttler for their entire lifetime, not just while waiting for dispatch).
I've made a little progress isolating this.
"osd client message size cap = 5000000" makes the stall
completely reproducible (which also means I can reproduce
on two different network types, ethernet and IPoIB.), and I
am able to generate graphs of throttled/receive/process time
for each request received by an OSD (see attached SVG plot).
Such plots suggest to me my problem is caused by stalled
receives. Using debug ms = 30 on my OSDs turns up instances
of this:
osd.0.log:4514502:2012-02-08 12:34:39.258276 7f6acec77700 -- 172.17.131.32:6800/15199
>> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader
wants 4194432 from dispatch throttler 0/25000000
osd.0.log:4514503:2012-02-08 12:34:39.258298 7f6acec77700 -- 172.17.131.32:6800/15199
>> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader got
front 128
osd.0.log:4514504:2012-02-08 12:34:39.258325 7f6acec77700 -- 172.17.131.32:6800/15199
>> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader
allocating new rx buffer at offset 0
osd.0.log:4514507:2012-02-08 12:34:39.258423 7f6acec77700 -- 172.17.131.32:6800/15199
>> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader
reading nonblocking into 0x1656c000 len 4194304
osd.0.log:4514509:2012-02-08 12:34:39.259060 7f6acec77700 -- 172.17.131.32:6800/15199
>> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read
1369231 of 4194304
osd.0.log:4546819:2012-02-08 12:35:35.468156 7f6acec77700 -- 172.17.131.32:6800/15199
>> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader
reading nonblocking into 0x166ba48f len 2825073
osd.0.log:4546820:2012-02-08 12:35:35.468189 7f6acec77700 -- 172.17.131.32:6800/15199
>> 172.17.135.85:0/2712733083 pipe(0x2ef0000 sd=173 pgs=7 cs=1 l=1).reader read
1448 of 2825073
which I take to mean that the reader thread sat in poll() for 56 secs, in
this case.
I was able to correlate such stalls with tcpdump output collected on
clients. Here's an example from another run:
15:09:37.584600 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
23631561, win 65535, options [nop,nop,TS val 1096144 ecr 1100575], length 0
15:09:37.584613 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23631561:23663417, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
1096144], length 31856
15:09:37.584655 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23663417:23695273, ack 1218, win 20904, options [nop,nop,TS val 1100615 ecr
1096144], length 31856
15:09:37.624476 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
23695273, win 65535, options [nop,nop,TS val 1096184 ecr 1100615], length 0
15:09:37.624489 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23695273:23727129, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
1096184], length 31856
15:09:37.624532 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
23727129:23758985, ack 1218, win 20904, options [nop,nop,TS val 1100655 ecr
1096184], length 31856
15:09:37.664454 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
23758985, win 65535, options [nop,nop,TS val 1096224 ecr 1100655], length 0
15:09:37.664468 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23758985:23790841, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
1096224], length 31856
15:09:37.664506 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23790841:23822697, ack 1218, win 20904, options [nop,nop,TS val 1100695 ecr
1096224], length 31856
15:09:37.706937 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
23822697, win 65535, options [nop,nop,TS val 1096266 ecr 1100695], length 0
15:09:37.706950 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23854553, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
1096266], length 31856
15:09:37.706995 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [P.], seq
23854553:23886409, ack 1218, win 20904, options [nop,nop,TS val 1100738 ecr
1096266], length 31856
15:09:37.929946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1100961 ecr
1096266], length 1448
15:09:38.376961 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1101408 ecr
1096266], length 1448
15:09:39.270947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1102302 ecr
1096266], length 1448
15:09:41.056943 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1104088 ecr
1096266], length 1448
15:09:44.632946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1107664 ecr
1096266], length 1448
15:09:51.784947 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1114816 ecr
1096266], length 1448
15:10:06.088945 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1129120 ecr
1096266], length 1448
15:10:34.728951 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1157760 ecr
1096266], length 1448
15:11:31.944946 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23822697:23824145, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
1096266], length 1448
15:11:31.945075 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976], length 0
15:11:31.945091 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23886409:23889305, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
1210496], length 2896
15:11:31.945178 IP 172.17.131.32.6808 > 172.17.135.7.37045: Flags [.], ack
23824145, win 65535, options [nop,nop,TS val 1210496 ecr 1214976,nop,nop,sack 1
{23886409:23887857}], length 0
15:11:31.945199 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23824145:23825593, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
1210496], length 1448
15:11:31.945207 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23825593:23827041, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
1210496], length 1448
15:11:31.945214 IP 172.17.135.7.37045 > 172.17.131.32.6808: Flags [.], seq
23827041:23828489, ack 1218, win 20904, options [nop,nop,TS val 1214976 ecr
1210496], length 1448
So in this case the client retransmitted for ~2 minutes with no response from
the OSD. Note that during this time the client was talking to other OSDs on the
same server.
I want to try turning off GSO/GRO on my interfaces, but then
I think I need to post to netdev...
-- Jim
osd.0.msg-et.svg.bz2
Description: application/bzip
