Hi Charlie,

On Tue, Jun 30, 2015 at 10:24:44AM +0100, Charlie Smurthwaite wrote:
> Hi Willy,
> 
> Thanks for getting back to me.
> 
> I have not tried this with many versions. I initially observed this in 
> 1.5.9, and then upgraded to 1.5.12. So far I have only observed this in 
> my production setup and have not had an opportunity to reproduce it in a 
> lab environment.
(...)

OK, thanks for the precisions.

I'm quite puzzled when readin your trace : the contents were corrupted
exactly on a packet boundary. When the packet size is the same on the
two sides, it can be argued that this could be the result of a wrapping
buffer or something, but here you have IPv6 on the front side with an
MSS of 1288, and IPv4 on the back with an MSS of 1448.

Look at this (original.pcap, server to haproxy), the data at 0x1cc will
never appear on the other side (I've cut it in two lines for better
reading) :

17:33:43.121903 IP (tos 0x0, ttl 64, id 25273, offset 0, flags [DF], proto TCP 
(6), length 1319)
    185.22.208.61.80 > 185.22.208.17.36590: Flags [FP.], cksum 0xebec 
(correct), seq 3400580661:3400581928, ack 330395171, win 235, options 
[nop,nop,TS val 1298550837 ecr 2836623384], length 1267
        0x0000:  4500 0527 62b9 4000 4006 c09b b916 d03d  E..'b.@.@......=
...
        0x0130:  203d 0a20 2020 2020 2063 6173 6520 4163  .=.......case.Ac
        0x0140:  7469 7665 5375 7070 6f72 743a 3a58 6d6c  tiveSupport::Xml
        0x0150:  4d69 6e69 2e62 6163 6b65 6e64 2e6e 616d  Mini.backend.nam
        0x0160:  650a 2020 2020 2020 7768 656e 2027 4163  e.......when.'Ac
        0x0170:  7469 7665 5375 7070 6f72 743a 3a58 6d6c  tiveSupport::Xml
        0x0180:  4d69 6e69 5f52 4558 4d4c 273b 2020 2020  Mini_REXML';....
        0x0190:  2020 2020 5275 6e74 696d 6545 7272 6f72  ....RuntimeError
        0x01a0:  0a20 2020 2020 2077 6865 6e20 2741 6374  .......when.'Act
        0x01b0:  6976 6553 7570 706f 7274 3a3a 586d 6c4d  iveSupport::XmlM
        0x01c0:  696e 695f 4e6f 6b6f 6769 7269            ini_Nokogiri
-->     0x01c0:                                273b 2020              ';..
        0x01d0:  2020 204e 6f6b 6f67 6972 693a 3a58 4d4c  ...Nokogiri::XML
        0x01e0:  3a3a 5379 6e74 6178 4572 726f 720a 2020  ::SyntaxError...
        0x01f0:  2020 2020 7768 656e 2027 4163 7469 7665  ....when.'Active
        0x0200:  5375 7070 6f72 743a 3a58 6d6c 4d69 6e69  Support::XmlMini
        0x0210:  5f4e 6f6b 6f67 6972 6953 4158 273b 2020  _NokogiriSAX';..
        0x0220:  5275 6e74 696d 6545 7272 6f72 0a20 2020  RuntimeError....
        0x0230:  2020 2077 6865 6e20 2741 6374 6976 6553  ...when.'ActiveS
        0x0240:  7570 706f 7274 3a3a 586d 6c4d 696e 695f  upport::XmlMini_
        0x0250:  4c69 6258 4d4c 273b 2020 2020 2020 204c  LibXML';.......L
        0x0260:  6962 584d 4c3a 3a58 4d4c 3a3a 4572 726f  ibXML::XML::Erro
        0x0270:  720a 2020 2020 2020 7768 656e 2027 4163  r.......when.'Ac

And the other side :

17:33:43.333580 IP6 2a00:67a0:a:3::194.80 > 
2a00:67a0:a:6:9e:d249:4be6:5fb8.59872: Flags [P.], seq 3058168642:3058169930, 
ack 1833196557, win 90, options [nop,nop,TS val 2836623437 ecr 884585457], 
length 1288
        0x0000:  6000 0000 0528 0640 2a00 67a0 000a 0003  `....(.@*.g.....
        0x0010:  0000 0000 0000 0194 2a00 67a0 000a 0006  ........*.g.....
        0x0020:  009e d249 4be6 5fb8 0050 e9e0 b647 f342  ...IK._..P...G.B
        0x0030:  6d44 5c0d 8018 005a 669a 0000 0101 080a  mD\....Zf.......
        0x0040:  a913 704d 34b9 b3f1 2054 696d 652e 7574  ..pM4....Time.ut
(...)
        0x04c0:  6520 4163 7469 7665 5375 7070 6f72 743a  e.ActiveSupport:
        0x04d0:  3a58 6d6c 4d69 6e69 2e62 6163 6b65 6e64  :XmlMini.backend
        0x04e0:  2e6e 616d 650a 2020 2020 2020 7768 656e  .name.......when
        0x04f0:  2027 4163 7469 7665 5375 7070 6f72 743a  .'ActiveSupport:
        0x0500:  3a58 6d6c 4d69 6e69 5f52 4558 4d4c 273b  :XmlMini_REXML';
        0x0510:  2020 2020 2020 2020 5275 6e74 696d 6545  ........RuntimeE
        0x0520:  7272 6f72 0a20 2020 2020 2077 6865 6e20  rror.......when.
        0x0530:  2741 6374 6976 6553 7570 706f 7274 3a3a  'ActiveSupport::
        0x0540:  586d 6c4d 696e 695f 4e6f 6b6f 6769 7269  XmlMini_Nokogiri
-->
17:33:43.333700 IP6 2a00:67a0:a:3::194.80 > 
2a00:67a0:a:6:9e:d249:4be6:5fb8.59872: Flags [.], seq 3058169930:3058171218, 
ack 1833196557, win 90, options [nop,nop,TS val 2836623437 ecr 884585457], 
length 1288
        0x0000:  6000 0000 0528 0640 2a00 67a0 000a 0003  `....(.@*.g.....
        0x0010:  0000 0000 0000 0194 2a00 67a0 000a 0006  ........*.g.....
        0x0020:  009e d249 4be6 5fb8 0050 e9e0 b647 f84a  ...IK._..P...G.J
        0x0030:  6d44 5c0d 8010 005a 51dd 0000 0101 080a  mD\....ZQ.......
        0x0040:  a913 704d 34b9 b3f1 2020 2020 203c 2f6c  ..pM4........</l
        0x0050:  6f67 6f3e 0a20 2020 2020 203c 2f62 6c6f  ogo>.......</blo
        0x0060:  673e 0a20 2020 2058 4d4c 0a20 2020 2066  g>.....XML.....f
        0x0070:  696c 6520 3d20 4861 7368 2e66 726f 6d5f  ile.=.Hash.from_
        0x0080:  786d 6c28 626c 6f67 5f78 6d6c 295b 2762  xml(blog_xml)['b
        0x0090:  6c6f 6727 5d5b 276c 6f67 6f27 5d0a 2020  log']['logo']...
        0x00a0:  2020 6173 7365 7274 5f65 7175 616c 2027  ..assert_equal.'
        0x00b0:  756e 7469 746c 6564 272c 2066 696c 652e  untitled',.file.
        0x00c0:  6f72 6967 696e 616c 5f66 696c 656e 616d  original_filenam
        0x00d0:  650a 2020 2020 6173 7365 7274 5f65 7175  e.....assert_equ
        0x00e0:  616c 2027 6170 706c 6963 6174 696f 6e2f  al.'application/
        0x00f0:  6f63 7465 742d 7374 7265 616d 272c 2066  octet-stream',.f

==> totally unrelated contents!

What causes a problem to me is that haproxy doesn't decide how/where the
packets are segmented, the TCP stack does it. This does not mean haproxy
is not faulty, but that it had 1288 valid positions to cut these data and
that by pure luck it cut them on the only position which is exactly between
two packets, which looks suspicious. I remember we faced a very similar
problem a very long time ago with TCP splicing due to a kernel bug, and it
was triggered by retransmits (which also happen in your case). However I'm
seeing "nosplice" in your config, so that puzzles me even more.

Would you have the ability to test it over IPv4 only on the client side,
so that we see if the issue happens when mixing v4 with v6 ?

If you could run a few tests in TCP mode instead of HTTP, it would provide
a significant help as well.

Another interesting test to run would be to have haproxy running through
strace during a faulty test. You'll need to do it like this. The log will
be large:

     strace -o log -tts 16384 -p $(haproxy_pid)

The trace will be huge but it will tell us what haproxy sends down to the
stack. And if we see the corruption there, we'll see if it happens in the
middle of a send() or at the beginning immediately after another one,
indicating a wrapping buffer.

Just as a quick verification, since I saw your kernel is dated December,
I checked the fixes merged since then in the 3.13.y-ckt branch and found
nothing apparently relevant.

Thanks!
Willy


Reply via email to