Hi all, I want to gather statistics on how many failed TLS handshakes are happening across a large gRPC deployment. (The motivation is a long story, but basically I want to be able to trigger alerts if something goes wrong with the PKI that's generating the client and server certificates.) Just a count of failed TLS handshakes would be sufficient, although if I can get at more detailed errors, that might be helpful. I see how to do this in the Go gRPC code, but also need the same support for Python.
I'd mildly prefer to capture this information on the server, although it may be adequate to capture it on the client if that's easier to do. (I only need either the server or the client, not both.) Could someone point me in the right direction? I can take a cut at implementing this as a general feature suitable for a pull request, but I'm not sure the best approach to use given the code structure. What I've been able to determine (I think) so far: * Actual success or failure of the TLS handshakes is determined (for both client and server) in src/core/tsi/ssl_transport_security.cc. * For the server, these errors pass up to on_handshake_done in src/core/ext/transport/chttp2/server/chttp2_server.cc, where they're logged and discarded and the channel is closed. This seems to be below the layer where the Python bindings have any visibility. (In other words, so far as I could determine, no Python code ever sees the error from an attempted connection that results in a failed TLS handshake.) * For the client, it looks like (?) these errors will trigger a notify closure in src/core/ext/transport/chttp2/client/chttp2_connector.cc in on_handshake_done, which in turn seems to be on_subchannel_connected in core/ext/filters/client_channel/subchannel.cc, but it looks like the exact contents of the error don't go any farther beyond that function. Therefore, unless I'm mistaken, it looks like these errors are swallowed inside the core code in places where I can't get visibility to them from Python, hiding entirely (except for a log message) in the server and turning into a generic connection state inside the client channels. So there seems to be some plumbing or a hook missing here to be able to bubble these failures up to a level where I can get at them and send them to monitoring code. -- Russ Allbery ([email protected]) <http://www.eyrie.org/~eagle/> -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/grpc-io. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/87in7cxy6z.fsf%40hope.eyrie.org. For more options, visit https://groups.google.com/d/optout.
