A structure occasionally had an uninitialized boolean value that was directly set into the reply message. Unspecified Behavior Sanitizer (libubsan) found it for us.
On Wednesday, March 24, 2021 at 2:23:04 PM UTC-4 [email protected] wrote: > The deserialization happens at the surface layer instead of the transport > layer, unless we suspect that HTTP/2 frames themselves were malformed. If > we suspect the serialization/deserialization code, we can check if simply > serializing the proto to bytes and back is causing issues. Protobuf has > utility functions to do this. Alternatively, gRPC has utility functions > here > https://github.com/grpc/grpc/blob/master/include/grpcpp/impl/codegen/proto_utils.h > > I am worried for memory corruption though so that is certainly something > to check. > > > On Wednesday, March 24, 2021 at 11:02:30 AM UTC-7 Bryan Schwerer wrote: > >> Thanks for replying. >> >> I was able to get a tcpdump capture and run it through the wireshark >> disector. It indicated that there were malformed protobuf fields in the >> message. I'm guessing the client threw the messages away. I didn't see a >> trace message indicating that. Is there some sort of stat I can check? >> Would it be possible that older versions didn't discard malformed message? >> I haven't loaded up an old version of our code, but I suspect it has always >> been there. The end of the message has counters and such that if they were >> a bit off, no one would notice. >> >> I think we are corrupting the messages on the server side, I turned on >> -fstack-protector-all and the problem went away. If there's a possible way >> to check the message before sending to Writer, that may give us more >> information. We don't use arenas. The message itself is uint32's, bool's >> and one string. I assume protobufs makes a copy of the string and not the >> pointer to the buffer. >> >> On Wednesday, March 24, 2021 at 1:35:29 PM UTC-4 [email protected] wrote: >> >>> This is pretty strange. It is possible that we are being blocked on flow >>> control. I would check that we are making sure that the application layer >>> is reading. If I am not mistaken, `perform_stream_op[s=0x7f0e16937290]: >>> RECV_MESSAGE` is a log that is seen at the start of an operation meaning >>> that the HTTP/2 layer hasn't yet been instructed to read a message, (or >>> there is a previous read on the stream already that hasn't finished). Given >>> that you are just updating the gRPC version from 1.20 to 1.36.1, I do not >>> have an answer as to why you would see this without any application >>> changes. >>> >>> A few questions - >>> Do the two streams use the same underlying channel/transport? >>> Are the clients and the server in the same process? >>> Is there anything special about the environment this is being run in? >>> >>> (One way to make sure that the read op is being propagated to the >>> transport layer, is to check the logs with the "channel" tracer.) >>> On Friday, March 19, 2021 at 12:59:30 PM UTC-7 Bryan Schwerer wrote: >>> >>>> Hello, >>>> >>>> I'm in the long overdo process of updating gRPC from 1.20 to 1.36.1. I >>>> am running into an issue where the streaming replies from the server are >>>> not reaching the client in about 50% of the instances. This is binary, >>>> either the streaming call works perfectly or it doesn't work at all. >>>> After >>>> debugging a bit, I turned on the http tracing and from what I can tell, >>>> the >>>> http messages are received in the client thread, but where in the correct >>>> case, perform_stream_op[s=0x7f0e16937290]: RECV_MESSAGE is logged, but in >>>> the broken case it isn't. No error messages occur. >>>> >>>> I've tried various tracers, but haven't hit anything. The code is >>>> pretty much the same pattern as the example and there's no indication any >>>> disconnect has occurred which would cause the call to terminate. Using >>>> gdb >>>> to look at the thread, it is still in epoll_wait. >>>> >>>> The process in which this runs calls 2 different synchronous server >>>> streaming calls to the same server in separate threads. It also is a gRPC >>>> server. Everything is run over the internal 'lo' interface. Any ideas on >>>> where to look to debug this? >>>> >>>> Thanks, >>>> >>>> Bryan >>>> >>> -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/650d981a-fc7d-4706-aaeb-e04fbfff0949n%40googlegroups.com.
