I finally managed to track
down the issue, the cause was much simpler than I had thought.
As I've mentioned before, the service exposed through this
HAProxy instance is mainly accessed by mobile devices. The errors
appeared when apps where closed (either manually or because of a crash)
when a HTTPS connection was being
established (we're doing a final API call when the app is being closed,
for
example). I've managed to replicate this behavior reliably.
This also triggered some BADREQ errors, if the SSL connection was
established but no data was ever sent.
The reason that we didn't detect this earlier was that AWS ELB
didn't
offer any logging, and the CloudWatch metrics where for HTTP return
statuses (2XX, 4XX&5XX). Of course, these cases didn't trigger any
of
those.
Thanks again to everyone for your support!
Best,
Andrei
On 07/08/2013 11:06 AM, Andrei
Marinescu wrote:
Hi Lukas,
Unfortunately I'm not able to reproduce this on any of the devices I
have access to, I'm just seeing these erros in the logs and I'm
trying
to track down the issue. I guess I'll try to find an easy to
reproduce
scenario and return with a cap file at that time.
Just so that I can delete one possibility from my list, is it
possible
that some devices reject the certificate I'm using? I'm thinking of
this
because I ran into an issue with this CA on another server (a
payment
gateway wouldn't connect over HTTPS, problem solved by changing the
cert). 99% of the devices connecting to this endpoint are Android
and
iOS devices, and given the fragmentation that Android is suffering
of
this wouldn't suprise me.
Thanks everyone!
Best,
Andrei
Lukas Tribus
<mailto:[email protected]>
July 8, 2013 11:46 AM
Hi Andrei,
I only see a single session of that IP in the cap file.
What we can see from the dump is:
- the client provides both a TLS session ticket and a session ID
- the server acknowledges the session ID
- the server sends a "Change Cipher Spec" message [1]
- the client disconnects
I don't think this is enough information to draw a conclusion. A
wild
guess could be that the client gets upset about the Change Cipher
Spec
message, but that is really a very wild guess.
We would need to see the session before and after this one, to be
able to
put them in context. Any additional informations about the
User-Agent
would
certainly also help.
Btw, can you clearly reproduce this, or is this a random session
failed on
your prodution box?
Regards,
Lukas
[1]
http://de.wikipedia.org/wiki/Transport_Layer_Security#TLS_Change_Cipher_Spec_Protocol
Willy Tarreau <mailto:[email protected]>
July 8, 2013 9:40 AM
Hello Andrei,
That would definitely help, in order to pass it via ssldump. Or you
can
do it yourself as well. What I'm seeing anyway (-q wasn't the most
helpful
option here :-)) is that the client closes first. The sequence looks
like
this :
client SYN server
port 58713 -----------------------> :443
SYN/ACK
<-----------------------
ACK
----------------------->
PSH: TLSv1 client hello with SNI
----------------------->
PSH: TLSv1 server hello
<-----------------------
FIN: client decides to close
----------------------->
FIN: server acknowledges and closes
<-----------------------
RST: client had already closed
----------------------->
So in short, the client disagrees with what the server proposed.
Either
it's because of the algorithms in use, or because something is
missing.
For example, I'm not seeing any certificate presented by the server,
so
it looks like session resumption.
Ssldump would tell us what algorithms were negociated in each
direction.
You can also try with tshark/wireshark I think.
Best regards,
Willy
Andrei Marinescu <mailto:[email protected]>
July 8, 2013 9:16 AM
Hello Willy,
Thank you for your answer! I've attached a dump with two requests
from
the same ip. First one failed with Connection closed during SSL
handshake, the second one failed with Timeout during SSL handshake.
I've translated the .cap file with tcpdump -qns 0 -X -r file.cap
>
translated.cap in order to make the dump readable and extract the
two
requests. If the original dump is needed, let me know and I'll
attach
it a.s.a.p.
Willy Tarreau <mailto:[email protected]>
July 7, 2013 10:02 PM
Hello Andrei,
It's very hard to suggest anything unfortunately, since most SSL/TLS
errors
can be very cryptic. It would be nice if you could take a pcap
capture of
one such faulty connection so that we can see the whole handshake
and try
to find what the issue is. Many things can be involved, including
versions,
algorithms, key sizes, etc...
In order to take this capture, please use "tcpdump -s0 -npi eth0 -w
file.cap"
to ensure that packets are not truncated. If you'd prefer not to
reveal your
public IP address on the list, then please send me the capture in
private.
But I must say that people here on the list tend to read SSL traces
faster
than me :-)
Regards,
Willy
Andrei Marinescu <mailto:[email protected]>
July 7, 2013 6:08 PM
Hello everyone!
I've moved off AWS ELB today to HAProxy 1.5dev18. I'm doing SSL
termination at the LB and I'm encountering a rather large number of
messages such as:
- SSL Handshake failure
- Timeout during SSL handshake
- Connection closed during SSL handshake
The problem is similar to the one I've found in the archives about 2
weeks ago
(http://marc.info/?l=haproxy&m=137158875803495&w=2), but
unfortunately I'm unable to debug this. I'm trying to clarify if
these
are errors that are normal and I just didn't see on ELB, or if
there's
anything to do to better configure HAProxy. As far as I can see in
the
logs, some hosts are able to connect successfully sometimes, and
with
errors other times. Hosts that have errors tend to have more errors
than successful requests. Also, almost of the devices accessing this
service are Android and iOS devices.
I'm using a free StartSSL certificate.
I've posted the relevant haproxy.cfg lines below. Any ideas are
extremly welcome!
defaults
option accept-invalid-http-request
option httplog
log global
mode http
option http-server-close
option redispatch
timeout connect 60000ms
timeout client 60000ms
timeout server 60000ms
frontend www_secure
mode http
bind 0.0.0.0:443 ssl crt CERTNAME1.pem crt CERTNAME2.pem
(acl's directing traffic to 2 backends)
Hi andrei,
I suspect the issue is linked to the ECDHE cipher used (0xc014).
Could you do some test excluding ECDHE ciphers from available suite.
Re-check if error occured adding
cipher AES:RC4:ALL:!aNULL:!eNULL:!LOW:!EXPORT:!SSLv2:!ECDH
On the bind line.
Regards,
Emeric
Hi Lukas,
Unfortunately I'm not able to reproduce this on any of the devices I
have access to, I'm just seeing these erros in the logs and I'm trying
to track down the issue. I guess I'll try to find an easy to reproduce
scenario and return with a cap file at that time.
Just so that I can delete one possibility from my list, is it possible
that some devices reject the certificate I'm using? I'm thinking of this
because I ran into an issue with this CA on another server (a payment
gateway wouldn't connect over HTTPS, problem solved by changing the
cert). 99% of the devices connecting to this endpoint are Android and
iOS devices, and given the fragmentation that Android is suffering of
this wouldn't suprise me.
Thanks everyone!
Best,
Andrei
Hi Andrei,
I
only see a single session of that IP in the cap file. What we
can see from the dump is: - the client provides both a TLS session
ticket and a session ID - the server acknowledges the session ID -
the server sends a "Change Cipher Spec" message [1] - the client
disconnects I don't think this is enough information to draw a
conclusion. A wild guess could be that the client gets upset about
the Change Cipher Spec message, but that is really a very wild guess. We
would need to see the session before and after this one, to be able to put
them in context. Any additional informations about the User-Agent would certainly
also help. Btw, can you clearly reproduce this, or is this a
random session failed on your prodution box? Regards, Lukas [1]
http://de.wikipedia.org/wiki/Transport_Layer_Security#TLS_Change_Cipher_Spec_Protocol
Hello Andrei,
That
would definitely help, in order to pass it via ssldump. Or you can do
it yourself as well. What I'm seeing anyway (-q wasn't the most helpful option
here :-)) is that the client closes first. The sequence looks like this
:
client SYN server port 58713
-----------------------> :443 SYN/ACK
<----------------------- ACK
-----------------------> PSH: TLSv1 client
hello with SNI ----------------------->
PSH: TLSv1 server hello
<----------------------- FIN: client decides to close
-----------------------> FIN: server
acknowledges and closes <-----------------------
RST: client had already closed
----------------------->
So in short, the client disagrees
with what the server proposed. Either it's because of the algorithms
in use, or because something is missing. For example, I'm not seeing
any certificate presented by the server, so it looks like session
resumption.
Ssldump would tell us what algorithms were negociated
in each direction. You can also try with tshark/wireshark I think.
Best
regards, Willy
Hello Willy,
Thank you for your answer! I've attached a dump with two requests from
the same ip. First one failed with Connection closed during SSL
handshake, the second one failed with Timeout during SSL handshake.
I've translated the .cap file with tcpdump -qns 0 -X -r file.cap >
translated.cap in order to make the dump readable and extract the two
requests. If the original dump is needed, let me know and I'll attach it
a.s.a.p.
|