On Feb 14, 2011, at 10:37 AM, Robert Olson wrote:
>
>> Hello Robert,
>>
>> On 14/02/2011, at 16:36, Robert Olson wrote:
>>
>>> The problem we're seeing is that for a particular test script on the
>>> client side, one of the exchanges is failing.
>>>
>>> Looking at packet traces, I see the client sending a complete
>>> request to cherokee. Cherokee sends the request to the compute
>>> server, but it appears to be truncating the request one packet shy
>>> of finishing it. the compute server then reports a bad parse in
>>> response.
>>>
>>> The problem initially showed up fairly reliably only when both front
>>> ends were running. If I killed wackamole on one of them (pushing
>>> both IPs over to a single server) the problem vanished.
>>
>> Did it perform a clean 'three way' close sequence (FIN, FIN+ACK, ACK)?
>> The last package might be lost if a RST were sent while the connection
>> is being closed.
>>
>> Cheers!
>
> Here's the last bit of one of the failed exchanges. ml-mds is the frontend,
> oak is the compute server. It sure looks like the frontend just decided to
> close down the connection; its last packet appears to be the FIN. I
> unfortunately don't appear to have saved any subsequent packets. I'll be
> trying today to replicate the problem again and get strace output on the
> frontends as well as detailed packet traces all around.
>
> Thanks,
No joy yet on getting the problem repeated, but on looking at the traces for
the successful runs, the initial FIN was sent by the compute server when it
finished writing its output; the client (cherokee) only sent its FIN in
response. We've had some flaky behavior with the network switch that these
systems so I'm not going to rule out hardware issues (though how a hardware
failure would trigger an early FIN seems weird).
Aha. I think the key is here:
10:48:00.188691 IP oak.mcs.anl.gov.5104 > ml-mds.mcs.anl.gov.42694: . ack 15521
win 288 <nop,nop,timestamp 1364352976 1359990193>
0x0000: 4500 0034 68ae 4000 4006 c140 c005 c860 E..4h.@.@..@...`
0x0010: c005 c869 13f0 a6c6 7649 376b 6cd7 4707 ...i....vI7kl.G.
0x0020: 8010 0120 789a 0000 0101 080a 5152 5fd0 ....x.......QR_.
0x0030: 510f cdb1 Q...
10:48:17.186797 IP ml-mds.mcs.anl.gov.42694 > oak.mcs.anl.gov.5104: F
15521:15521(0) ack 1 win 46 <nop,nop,timestamp 1360007195 1364352976>
0x0000: 4500 0034 c7e4 4000 4006 620a c005 c869 E..4..@[email protected]
0x0010: c005 c860 a6c6 13f0 6cd7 4707 7649 376b ...`....l.G.vI7k
0x0020: 8011 002e 3721 0000 0101 080a 5110 101b ....7!......Q...
0x0030: 5152 5fd0
There's 17 seconds between those packets. The cherokee timeout had been set to
the default at this point. I bet oak was waiting for its additional data,
ml-mds wasn't sending it, and cherokee timed out and closed the connection. I
think this is consistent with dropped packets.
--bob
_______________________________________________
Cherokee mailing list
[email protected]
http://lists.octality.com/listinfo/cherokee