I observed that when building tcnative against OpenSSL 1.1.1 I ran into hangs when talking TLS 1.0 with Tomcat trunk using that tcnative plus Nio(2).

A simple "GET /" request eg. send with curl, hangs for 60 seconds after a successful TLS handshake, then the client ends with an "empty reply from server".

You can also reproduce with openssl s_client. The request will hang until you send another additional empty line (in addition to the usual empty line ending the request). The additional one will then trigger another read which will find the old request data and handle it.

The problem does not occur with the APR connector. APR and Nio(2) seem to use very different code paths in tcnative for TLS handling (sslnetwork.c versus ssl.c).

I have some understanding of the root cause but currently no good idea how to fix it. The root cause is incorrect handling of SSL_read when it returns "0". The OpenSSL man page has a relevant description at [1]. As observed also in mod_ssl (Apache web server), OpenSSL 1.1.1 behaves different than older version in that it can return "0", were old versions returned "-1". That was always documented as a possibility but in reality now really happens. The tcnative code used by APR handles this in the native part. The code used by Nio(2) simply returns the value it gets from SSL_read() and leaves it to the calling Java to handle that. netty, from which we borrowed the ideas for Java plus OpenSSL, does include such code in ReferenceCountedOpenSslEngine.java, especially the SSL_ERROR_WANT_READ and SSL_ERROR_WANT_WRITE handling.

I could have experimented with their approach, but for some reason there seems to be another problem that makes it harder. The relevant call to SSL_read() returns "0", but does not return WANT_READ or WANT_WRITE from a following SSL_get_error(), but instead "5", which is SSL_ERROR_SYSCALL. I do not have a good idea, where this comes from. When tracing system calls, it seems it comes from an EAGAIN in a socket read, but I am not sure about that.

In our Java code, what happens is a call to unwrap() in OpenSSLEngine. This call writes I think 146 bytes, then checks pendingReadableBytesInSSL(). That call in turn calls SSL.readFromSSL() and gets back "0" (from SSL_read()). Up in unwrap() we then skip the while loop and finally return with BUFFER_UNDERFLOW. Then we hang, probably because the data was read by OpenSSL and no more socket event happens. If I artificially add another call to pendingReadableBytesInSSL() which triggers another SSL_read(), the hang does not occur.

IMHO TLS 1.0 is not such a big problem, but we should at least document it when we do a new release.

I might drill down debugging into the native layer checking errno etc. but I am not sure I will find the time.

[1]: https://www.openssl.org/docs/man1.1.1/man3/SSL_read.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to