Hi Guillaume,

On Fri, Mar 12, 2010 at 03:07:52PM +0100, Guillaume Castagnino wrote:
> Hi,
> 
> Here is my setup :
> - 2 debian lenny nodes, with haproxy 1.3.22 (lenny backport package)
> - kernel 2.6.33 with last grsecurity patch 
> (grsecurity-2.1.14-2.6.33-201003071645.patch)
> - postfix 2.5.5
> 
> haproxy runs on one of the two nodes (wich one is controlled by heartbeat), 
> and uses the 2 nodes as backend for HTTP, HTTPS and SMTP.
> 
> No problem with HTTP and HTTPS backends.
> 
> 
> But I have a problem with the SMTP backends when enabling BLACKHOLE 
> grsecurity 
> feature. I spend some time with Brad Spengler (grsec dev) to try to fix this 
> within grsec. Tried some patches. But nothing found. There seems to be a 
> missing RST packet when closing connection, and for now he found no way to 
> fix 
> it whithout disabling BLACKHOLE feature.

I've just looked at your traces. It's strange that it's related to the
blackhole feature because the doc says it just disables sending of
port unreachables (and possibly RSTs). From your traces, an RST is
properly sent in response to the "250", but the server happily
ignores despite the fact that its sequence number is OK, and it
keeps resending the same data over and over. And as your trace
shows that you sniffed on the server, there's no risk that the
RST was dropped on the network.

> Last thought was it could be a problem/bug within haproxy

Haproxy may probably help hide this issue, but the issue is clearly
below haproxy as it has no control over packets/sequence numbers/acks/etc...
which are purely TCP. Those are only processed by the OS. BTW, when the
RST is sent, it is because haproxy has already closed the socket and
does not own it anymore.

> Symptoms :
> - each SMTP probe (smtpchk) results to a socket in the LAST_ACK state on the 
> remote backend (the local backend is not affected since BLACKHOLE does not 
> affect local sockets).
> - Lots of TCP replay from the SMTP backend.
> - lots of smtp probes fails probably due to the big quantity of sockets 
> remaining in LAST_ACK state. From my stats, it's around 6% of the probes that 
> fails.

This issue uncovers a more concerning one. If your server remains in
LAST_ACK for a long time, it may mean that its timeouts are not applied
in this precise state, which might have been revealed by this issue.
If this is the case, it would mean it's easy to DoS it by just simulating
the same behaviour :-/

Your captures were very useful. I'll contact Brad about that. Maybe if
he explains me how his patch works, it will help him find how to fix it.
Do you mind if I CC you ?

Regards,
Willy


Reply via email to