Thanks for replying. I was able to get a tcpdump capture and run it through the wireshark disector. It indicated that there were malformed protobuf fields in the message. I'm guessing the client threw the messages away. I didn't see a trace message indicating that. Is there some sort of stat I can check? Would it be possible that older versions didn't discard malformed message? I haven't loaded up an old version of our code, but I suspect it has always been there. The end of the message has counters and such that if they were a bit off, no one would notice.
I think we are corrupting the messages on the server side, I turned on -fstack-protector-all and the problem went away. If there's a possible way to check the message before sending to Writer, that may give us more information. We don't use arenas. The message itself is uint32's, bool's and one string. I assume protobufs makes a copy of the string and not the pointer to the buffer. On Wednesday, March 24, 2021 at 1:35:29 PM UTC-4 [email protected] wrote: > This is pretty strange. It is possible that we are being blocked on flow > control. I would check that we are making sure that the application layer > is reading. If I am not mistaken, `perform_stream_op[s=0x7f0e16937290]: > RECV_MESSAGE` is a log that is seen at the start of an operation meaning > that the HTTP/2 layer hasn't yet been instructed to read a message, (or > there is a previous read on the stream already that hasn't finished). Given > that you are just updating the gRPC version from 1.20 to 1.36.1, I do not > have an answer as to why you would see this without any application > changes. > > A few questions - > Do the two streams use the same underlying channel/transport? > Are the clients and the server in the same process? > Is there anything special about the environment this is being run in? > > (One way to make sure that the read op is being propagated to the > transport layer, is to check the logs with the "channel" tracer.) > On Friday, March 19, 2021 at 12:59:30 PM UTC-7 Bryan Schwerer wrote: > >> Hello, >> >> I'm in the long overdo process of updating gRPC from 1.20 to 1.36.1. I >> am running into an issue where the streaming replies from the server are >> not reaching the client in about 50% of the instances. This is binary, >> either the streaming call works perfectly or it doesn't work at all. After >> debugging a bit, I turned on the http tracing and from what I can tell, the >> http messages are received in the client thread, but where in the correct >> case, perform_stream_op[s=0x7f0e16937290]: RECV_MESSAGE is logged, but in >> the broken case it isn't. No error messages occur. >> >> I've tried various tracers, but haven't hit anything. The code is pretty >> much the same pattern as the example and there's no indication any >> disconnect has occurred which would cause the call to terminate. Using gdb >> to look at the thread, it is still in epoll_wait. >> >> The process in which this runs calls 2 different synchronous server >> streaming calls to the same server in separate threads. It also is a gRPC >> server. Everything is run over the internal 'lo' interface. Any ideas on >> where to look to debug this? >> >> Thanks, >> >> Bryan >> > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/de6dbb25-4c70-43b1-8dbc-6dd4d0c2bfb2n%40googlegroups.com.
