The line does look a bit suspicious to me (It means the combiner has 30597 closures - which is a lot - the least significant bit in "last" is ignored - the actual count is stored in the remaining bits)
D0919 14:54:36.312349629 51030 combiner.c:163] C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995 Not sure why we ran in to this situation. Putting that aside for now, based on the logs (in the failed case), the threads are waiting for the batch to complete I0919 14:54:36.312170818 51030 call.c:2069] grpc_call_start_batch(call=0x7f4c22edb020, ops=0x7f550e39fde0, nops=4, tag= *0x7f4c22edb020*, reserved=(nil)) I0919 14:54:36.312199708 51030 call.c:1677] ops[0]: SEND_INITIAL_METADATA .. I0919 14:54:36.312210605 51030 call.c:1677] ops[1]: RECV_INITIAL_METADATA ptr=0x7f4c22e9e510 I0919 14:54:36.312214004 51030 call.c:1677] ops[2]: SEND_MESSAGE ptr=0x7f4c262df200 I0919 14:54:36.312216804 51030 call.c:1677] ops[3]: SEND_CLOSE_FROM_CLIENT .. D0919 14:54:36.312362846 51030 completion_queue.c:812] PENDING TAGS: *0x7f4c22edb020* The real reason to look here is why isn't the call making any progress (I am assuming the sends are successful but the receives aren't coming) - In other words, I think the threads are waiting in epoll_wait and cv_wait because there is nothing incoming on the fds. 1) Could you turn on the "tcp" traces as well? : i.e export GRPC_TRACE = "tcp,api,call_combiner,op_failure,client_channel" 2) Could you sync to the latest HEAD ? (we did fix a few bugs recently in timers and the polling engine very recently.. want to make sure you are on the latest HEAD. Based on the line numbers in callstacks, looks like your branch is a few days behind) thanks, Sree PS: >> Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx doesn't get called is because exec_ctx->active_combiner is actually NULL and so the function just returns false. Not yet sure why / how that became NULL This happens when we executed all the closures on the combiner .. so this may be a red-herring (maybe not..but I it would help to look at tcp traces first) On Tue, Sep 19, 2017 at 4:58 PM, ryan via grpc.io <[email protected]> wrote: > > Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx > doesn't get called is because exec_ctx->active_combiner is actually NULL > and so the function just returns false. Not yet sure why / how that became > NULL > > > On Tuesday, September 19, 2017 at 9:55:10 AM UTC-7, [email protected] > wrote: >> >> Err... Actually I just realized that my source files had changed so that >> stack isn't entirely accurate. Here is a proper stack trace: >> >> #0 0x00007f6ddb4a1a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from >>> /lib64/libpthread.so.0 >>> #1 0x00007f6dcbf4471c in gpr_cv_wait (cv=0x7f6db13f8a90, >>> mu=0x7f6a65b9cd90, abs_deadline=...) at src/core/lib/support/sync_posi >>> x.c:80 >>> #2 0x00007f6dcbe9bea8 in begin_worker (pollset=0x7f6a65b9cd90, >>> worker=0x7f6db13f8a70, worker_hdl=0x0, now=0x7f6db13f8a30, deadline=...) at >>> src/core/lib/iomgr/ev_epoll1_linux.c:747 >>> #3 0x00007f6dcbe9c7c9 in pollset_work (exec_ctx=0x7f6db13f8b80, >>> ps=0x7f6a65b9cd90, worker_hdl=0x0, now=..., deadline=...) at >>> src/core/lib/iomgr/ev_epoll1_linux.c:941 >>> #4 0x00007f6dcbea9691 in grpc_pollset_work (exec_ctx=0x7f6db13f8b80, >>> pollset=0x7f6a65b9cd90, worker=0x0, now=..., deadline=...) at >>> src/core/lib/iomgr/ev_posix.c:210 >>> #5 0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65b9cc80, deadline=..., >>> reserved=0x0) at src/core/lib/surface/completion_queue.c:914 >>> #6 0x00007f6dcbecc5d2 in grpc_completion_queue_next (cq=0x7f6a65b9cc80, >>> deadline=..., reserved=0x0) at src/core/lib/surface/completio >>> n_queue.c:992 >>> #7 0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch >>> (this_=0x7f6a9288a670, actions=...) at /opt/build/hhvm/grpc/src/php/e >>> xt/grpc/hhvm/call.cpp:583 >> >> >> and >> >> #0 0x00007f6dd56b6d43 in epoll_wait () from /lib64/libc.so.6 >>> #1 0x00007f6dcbe9b9e5 in do_epoll_wait (exec_ctx=0x7f6d99ff8b80, >>> ps=0x7f6a65867710, now=..., deadline=...) at src/core/lib/iomgr/ev_epoll1_l >>> inux.c:640 >>> #2 0x00007f6dcbe9c972 in pollset_work (exec_ctx=0x7f6d99ff8b80, >>> ps=0x7f6a65867710, worker_hdl=0x0, now=..., deadline=...) at >>> src/core/lib/iomgr/ev_epoll1_linux.c:964 >>> #3 0x00007f6dcbea9691 in grpc_pollset_work (exec_ctx=0x7f6d99ff8b80, >>> pollset=0x7f6a65867710, worker=0x0, now=..., deadline=...) at >>> src/core/lib/iomgr/ev_posix.c:210 >>> #4 0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65867600, deadline=..., >>> reserved=0x0) at src/core/lib/surface/completion_queue.c:914 >>> #5 0x00007f6dcbecc5d2 in grpc_completion_queue_next (cq=0x7f6a65867600, >>> deadline=..., reserved=0x0) at src/core/lib/surface/completio >>> n_queue.c:992 >>> #6 0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch >>> (this_=0x7f6a624e0670, actions=...) at /opt/build/hhvm/grpc/src/php/e >>> xt/grpc/hhvm/call.cpp:583 >> >> >> These are both proper examples of it just timing out on >> epoll_wait/pthread_cond_timedwait. >> >> The other thing I've noticed is on the bad request: >> >>> >>> D0919 14:54:36.312349629 51030 combiner.c:163] C:0x7f54faf35c00 >>> grpc_combiner_execute c=0x7f4c22edbfb0 *last=61995* >> >> >> last is 61995 which seems to indicate the number of scheduled items in >> the queue. In a normal request this number is kept low. >> >> On Tuesday, September 19, 2017 at 9:19:11 AM UTC-7, Michael Lumish wrote: >>> >>> It looks like your stack trace includes a call to grpc_server_request_ >>> registered_call with a NULL server. That is suspicious. >>> >>> On Tue, Sep 19, 2017 at 8:41 AM ryan via grpc.io < >>> [email protected]> wrote: >>> >>>> >>>> Hello all - >>>> >>>> I've been developing a gRPC-HHVM extension >>>> <https://github.com/grpc/grpc/pull/11553>. In the process of stress >>>> testing this extension we've consistently seen a condition where gRPC gets >>>> into a state where any request that is made gets indefinitely stuck on the >>>> epoll_wait syscall (or times out if a timeout is set on >>>> grpc_completion_queue_next). Everything works fine and is chugging along >>>> and then this problem happens and all of the requests get stuck/timeout. >>>> This is what it looks like in gdb: >>>> >>>> Thread 21 (Thread 0x7f54f6bff700 (LWP 51033)): >>>> >>>> #0 0x00007f554ad90d43 in epoll_wait () from /lib64/libc.so.6 *<-- >>>> indefinitely stuck or times out if one is set* >>>> >>>> #1 0x00007f55416969e5 in add_poll_object (exec_ctx=0x2482ad88, >>>> bag=0x188924, bag_type=(POLL_OBJ_POLLSET | unknown: 32596), item=0x0, >>>> item_type=(POLL_OBJ_POLLSET | unknown: 1097427428)) at >>>> src/core/lib/iomgr/ev_epollsig_linux.c:1484 >>>> >>>> #2 0x00007f5541697972 in fd_ref (fd=0x4, reason=0x33b8b87c0 <Address >>>> 0x33b8b87c0 out of bounds>, file=0x0, line=0) at >>>> src/core/lib/iomgr/ev_poll_posix.c:447 >>>> >>>> #3 0x00007f55416a4691 in tcp_drop_uncovered_then_handle_write >>>> (exec_ctx=0x7f54f6bf8680, arg=0x7f4c4e2c9010, error=0x0) at >>>> src/core/lib/iomgr/tcp_posix.c:239 >>>> >>>> #4 0x00007f55416c71a8 in grpc_server_request_registered_call >>>> (server=0x0, rmp=0x0, call=0x2482ad88, deadline=0x188924, >>>> initial_metadata=0x7f4c4e2c8f00, optional_payload=0x0, >>>> cq_bound_to_call=0x4, cq_for_notification=0x33b8b87c0, >>>> >>>> tag=0x0) at src/core/lib/surface/server.c:1486 >>>> >>>> >>>> #5 0x00007f55416c75d2 in conforms_to (slice=..., >>>> legal_bits=0x7f4c22ce2e50 "\270\064\245\214L\177", err_desc=0x7f54f6bf8701 >>>> "\211\030") at src/core/lib/surface/validate_metadata.c:40 >>>> >>>> #6 0x00007f5541b97105 in HPHP::c_Call_ni_startBatch >>>> (this_=0x7f4c22ce2e60, actions=...) at /home/ryangordon/hhvm/grpc/src >>>> /php/ext/grpc/hhvm/call.cpp:529 >>>> >>>> >>>> The stack trace doesn't always look like that but it always is stuck on >>>> epoll_wait/epoll_pwait often from gpr_cv_wait. >>>> >>>> >>>> I've managed to get GPRC_TRACE=all GRPC_VERBOSITY=DEBUG of a good >>>> request and a request when it is in the bad state to compare side by side. >>>> >>>> >>>> Good Request: https://gist.githubusercontent.com/RyanGordon/831a8 >>>> 213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b6 >>>> 4ab661a80e0c27/Not-Frozen%2520Log%2520Snippet >>>> >>>> Bad Request: https://gist.githubusercontent.com/RyanGordon/831a8 >>>> 213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b6 >>>> 4ab661a80e0c27/Frozen%2520log%2520snippet >>>> >>>> >>>> The bad request looks like this when it's execution deviates: >>>> >>>> >>>> D0919 14:54:36.312349629 51030 combiner.c:163] >>>> C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995 >>>> D0919 14:54:36.312353011 51030 exec_ctx.c:73] closure >>>> 0x7f4c22edbec0 finished >>>> I0919 14:54:36.312358626 51030 completion_queue.c:834] >>>> grpc_completion_queue_next(cq=0x7f4c22e9e200, deadline=gpr_timespec { >>>> tv_sec: 4, tv_nsec: 999000000, clock_type: 3 }, reserved=(nil)) >>>> >>>> ... >>>> >>>> >>>> Whereas, the good request does this: >>>> >>>> D0915 15:19:50.178995777 13689 combiner.c:163] >>>> C:0x7f932172f700 grpc_combiner_execute c=0x7f9313d67fb0 last=1 >>>> D0915 15:19:50.179005880 13689 exec_ctx.c:73] closure >>>> 0x7f9313d67ec0 finished >>>> D0915 15:19:50.179074826 13689 combiner.c:225] >>>> C:0x7f932172f700 grpc_combiner_continue_exec_ctx contended=0 >>>> exec_ctx_ready_to_finish=1 time_to_execute_final_list=0 >>>> D0915 15:19:50.179077765 13689 combiner.c:243] >>>> C:0x7f932172f700 maybe_finish_one n=0x7f9313d67fb0 >>>> >>>> ... (and a whole lot more) >>>> >>>> >>>> >>>> >>>> Effectively it seems like the something is not getting scheduled when it >>>> should be (grpc_combiner_continue_exec_ctx never gets run). I've been >>>> working on this for a few weeks now and haven't been able to track down >>>> why or a fix. I have the full logs if they might be helpful. I've been >>>> testing on my branch that is kept pretty close to grpc/grpc master. >>>> >>>> >>>> Does anyone have any idea what is going on? I have the bad process open on >>>> a test server in gdb and have timeouts implemented so I can run repeated >>>> requests at it to continue debugging if necessary. >>>> >>>> >>>> Thanks, >>>> >>>> Ryan >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "grpc.io" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/grpc-io. >>>> To view this discussion on the web visit https://groups.google.com/d/ms >>>> gid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com >>>> <https://groups.google.com/d/msgid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- > You received this message because you are subscribed to the Google Groups " > grpc.io" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/grpc-io. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/grpc-io/d8cd586a-d329-4a51-a030-fff17c458e7d%40googlegroups.com > <https://groups.google.com/d/msgid/grpc-io/d8cd586a-d329-4a51-a030-fff17c458e7d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/grpc-io. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CALRi9QcNjCz98ZVnkRQZqFKdab2%3DdFuh-s-4i-UdOXmF%2BH8PXQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
