On Tue, May 10, 2011 at 6:38 PM, Ger Hobbelt <[email protected]> wrote:
> .... >> >> > Nothing glaringly obvious to me in any of the code snippets. :-( (And, > Dave, thanks for catching f.u. where I was missing the __LINE__ in my code!) > > > It's past 0200 hours here so I'd better get some shut-eye, but here's a few > thoughts to ponder: > > I am very thankful for the detail reply as such a late hour. > > - you mentioned 'satellite link'. Given the wickedness of the issue, a few > baseline checks to get our assumptions straight: does your QA rig simulate > the 'long fat pipe' that's usual for satellite comms? (or does it not?) > (long fat pipe ~ long round trip delay, high RX/TX rate, but I've also seen > extremely high RX/TX ratios (way back when: both sides had different uplink > bandwidth then; kinda like ADSL but much worse). All this /should/'ve been > handled at the TCP level and not affect any upper layers (like SSL/sockets), > but better ask now then ride on an incorrect assumption. > > - the above implies the assumption that you're using SSL over TCP, not DTLS > (which, IIRC, is UDP-based. Correct me if I'm wrong on that one.) > Our QA setup does not simulate the satellite connection, we did test a single connection over a satellite link but that was a quick sanity test only. We did place a large set of clients behind a Nist Net box ( http://www-x.antd.nist.gov/nistnet/index.html) but simulated latency only, so the RTT was increased to 500ms each way. You are correct in that we are using SSL over TCP :) > > - given a grep through the code, all I can find as 'probable culprits' are > internal rbuf (read buffer) related and nothing that can be more or less > directly be tickled by SSL_write/read calls. Mix that with my own experience > of no trouble whatsoever in the RX/TX buffer dept. for years plus an error > which hints at an impending buffer overrun happening at yours (and then, > only in production. Nasty bugger to tackle) my next bet it's something > related to the 'satellite comms behaviour' as visible at the socket I/F > level. (I was also thinking about packet size and thus ever-so-slightly > altered recv/send behaviour, but I've been using OpenSSL over PPP, raw > serial, Ethernets and a few network oddities and never encountered this. :-S > ) > > - You said you had to 'rollback': is that to a state where you use an older > OpenSSL stack, a different SSL stack or no SSL at all? (Just another > baseline check here.) > Currently we have several different flavors and versions of Linux (Redhat 7, Redhat 3, CentOS, etc ...) out there using Java 1.X and some Sun software :(, they are making an SSL over TCP connection to a Java 1.5 Sun server software at our data center. Because of a compound of issues we decided to place a proxy server written in Python :) between the them. This Python proxy server is using SSL in non-blocking very similar to the Tornado web server released by Facebook. I would say about 2/3 of the connections are from over a satellite network and the rest are from broadband. So we have a great mixture of technology ;) The "upgrade" was adding the proxy in the middle and the revert was removing it, this issue may be happening now but the Java server logs are useless .... can you tell how I feel about Java :) > > > - the only thing that I can see which /might/ have impact (and this is mere > conjecture! But we're apparently looking at some sort of obscure edge case, > so options are open in my mind) is when both ends differ in their OpenSSL > builds, where one has OpenSSL compiled _with_ compression enabled, and the > other has not (OPENSSL_NO_COMP). > Line of thought goes like this: the thing that possibly maybe might screw > you up is when the other side sends packets which don't somehow fit in the > RX buffer, the SSL stack keeps on asking ssl3_read_n() to fetch ever more, > and the SSL packet is not caught as overlarge in other parts of the code > (which would be odd by itself, but let's continue). So the question is then: > how is the RX buffer dimensioned? It's set up in either ssl3_setup_buffers() > or ssl3_setup_read_buffer() and space depends on the SSL3_RT_MAX_PACKET_SIZE > define. Which, when followed through, is only really dependent on either the > OPENSSL_NO_COMP setting or someone seriously screwing around in the OpenSSL > code internals. So I opt for the 'no_comp' setting being a 'possibly maybe', > however improbable it may be. (because I've used that mix before and no > trouble for me; besides, one can expect both flavors to exist in the wild so > why is this a first then?) > > > Another thing that affects packet size in OpenSSL is the > SSL_OP_MICROSOFT_BIG_SSLV3_BUFFER option, which you can set, for example, > using the SSL_ctrl(s, SSL_CTRL_OPTIONS, SSL_OP_MICROSOFT_BIG_SSLV3_BUFFER, > NULL) call (or the eqv. for the SSL_CTX: SSL->options are copied from the > SSL_CTX). When one side has the option and the other has not, it just > might..... (again, with low probability, but these are the ways I can see > that things /could/ go wrong as you described. (Without a system where this > can be reproduced, we're down to 'intelligent' guesswork and a couple of > straws.) > > > Cheers and good night! > Again thanks for the reply at such a late hour and all the help so far. Unless you have any more ideas in the morning I will attempt the following: * Put our test clients behind a simulator that more accurately reflects a satellite network * Review our code to ensure we handle this issue correctly, like closing the socket * Review the Python SSL module more closely to see how it may be causing this issue Michael > > -- > Met vriendelijke groeten / Best regards, > > Ger Hobbelt > > -------------------------------------------------- > web: http://www.hobbelt.com/ > http://www.hebbut.net/ > mail: [email protected] > mobile: +31-6-11 120 978 > -------------------------------------------------- > > -- Ecclesiastes 1:18 18 For with much wisdom comes much sorrow; the more knowledge, the more grief.
