[grpc-io] Gathering statistics about TLS failure (Python)

Russ Allbery Thu, 24 May 2018 19:32:38 -0700

Hi all,

I want to gather statistics on how many failed TLS handshakes are
happening across a large gRPC deployment.  (The motivation is a long
story, but basically I want to be able to trigger alerts if something goes
wrong with the PKI that's generating the client and server certificates.)
Just a count of failed TLS handshakes would be sufficient, although if I
can get at more detailed errors, that might be helpful.  I see how to do
this in the Go gRPC code, but also need the same support for Python.


I'd mildly prefer to capture this information on the server, although it
may be adequate to capture it on the client if that's easier to do.  (I
only need either the server or the client, not both.)

Could someone point me in the right direction?  I can take a cut at
implementing this as a general feature suitable for a pull request, but
I'm not sure the best approach to use given the code structure.

What I've been able to determine (I think) so far:

* Actual success or failure of the TLS handshakes is determined (for both
  client and server) in src/core/tsi/ssl_transport_security.cc.

* For the server, these errors pass up to on_handshake_done in
  src/core/ext/transport/chttp2/server/chttp2_server.cc, where they're
  logged and discarded and the channel is closed.  This seems to be below
  the layer where the Python bindings have any visibility.  (In other
  words, so far as I could determine, no Python code ever sees the error
  from an attempted connection that results in a failed TLS handshake.)

* For the client, it looks like (?) these errors will trigger a notify
  closure in src/core/ext/transport/chttp2/client/chttp2_connector.cc in
  on_handshake_done, which in turn seems to be on_subchannel_connected in
  core/ext/filters/client_channel/subchannel.cc, but it looks like the
  exact contents of the error don't go any farther beyond that function.

Therefore, unless I'm mistaken, it looks like these errors are swallowed
inside the core code in places where I can't get visibility to them from
Python, hiding entirely (except for a log message) in the server and
turning into a generic connection state inside the client channels.  So
there seems to be some plumbing or a hook missing here to be able to
bubble these failures up to a level where I can get at them and send them
to monitoring code.

-- 
Russ Allbery ([email protected])              <http://www.eyrie.org/~eagle/>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/87in7cxy6z.fsf%40hope.eyrie.org.
For more options, visit https://groups.google.com/d/optout.

[grpc-io] Gathering statistics about TLS failure (Python)

Reply via email to