The primary cause of your problem is that you're calling completion_queue_pluck after calling cancel_with_status. Assuming you're following the model of the current PHP wrapper and calling pluck with an infinite timeout, you should only do so exactly once for each operation that creates a tag on the queue. The function start_batch creates a tag; cancel_with_status does not. So blocking forever is the expected behavior in that situation.
Also, you should not need to call cancel_with_status at all. If the call has timed out, that should mean that the call has already finished and that there is already a status. On Wed, Sep 20, 2017, 11:53 PM ryan via grpc.io <[email protected]> wrote: > I just got it to reproduce. Here are the last ~3,000 lines of the debug > output: > https://gist.github.com/RyanGordon/e3d211a6430daac560296a4b301f121b > > At the end, it seemed like all of the requests timed out, so we called ` > grpc_call_cancel_with_status > <https://github.com/quizlet/grpc/blob/hhvm-bz2/src/php/ext/grpc/hhvm/call.cpp#L568>` > to clean the request up. After we call that, we call ` > grpc_completion_queue_pluck > <https://github.com/quizlet/grpc/blob/hhvm-bz2/src/php/ext/grpc/hhvm/call.cpp#L572>` > (justification > <https://github.com/grpc/grpc/pull/12306#issuecomment-330337644>) with an > infinite timeout so it gets stuck there forever. After about 4 minutes of > nothing happening, gRPC reports a "Socket closed" error, but nothing after > that. > > The one line that sticks out to me is L2677 from the gist. Typically > `grpc_call_combiner_stop()` gets called right after that line but not in > this case and then the behavior shown in the logs look atypical after that. > > On Wednesday, September 20, 2017 at 12:57:35 PM UTC-7, Sree Kuchibhotla > wrote: > >> Yes, GRPC_VERBOSITY=DEBUG. >> >> On Wed, Sep 20, 2017 at 12:10 PM, ryan via grpc.io < >> [email protected]> wrote: >> > 1) Yep, will do - I should continue to run with GRPC_VERBOSITY=DEBUG right? >>> 2) Just merged latest master through >>> >>> >>> On Wednesday, September 20, 2017 at 9:32:54 AM UTC-7, Sree Kuchibhotla >>> wrote: >>>> >>>> The line does look a bit suspicious to me (It means the combiner >>>> has 30597 closures - which is a lot - the least significant bit in "last" >>>> is ignored - the actual count is stored in the remaining bits) >>>> >>>> D0919 14:54:36.312349629 51030 combiner.c:163] >>>> C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995 >>>> >>>> Not sure why we ran in to this situation. Putting that aside for now, >>>> based on the logs (in the failed case), the threads are waiting for the >>>> batch to complete >>>> >>>> I0919 14:54:36.312170818 51030 call.c:2069] >>>> grpc_call_start_batch(call=0x7f4c22edb020, ops=0x7f550e39fde0, nops=4, >>>> tag= >>>> *0x7f4c22edb020*, reserved=(nil)) >>>> I0919 14:54:36.312199708 51030 call.c:1677] ops[0]: >>>> SEND_INITIAL_METADATA >>>> .. >>>> I0919 14:54:36.312210605 51030 call.c:1677] ops[1]: >>>> RECV_INITIAL_METADATA ptr=0x7f4c22e9e510 >>>> I0919 14:54:36.312214004 51030 call.c:1677] ops[2]: >>>> SEND_MESSAGE ptr=0x7f4c262df200 >>>> I0919 14:54:36.312216804 51030 call.c:1677] ops[3]: >>>> SEND_CLOSE_FROM_CLIENT >>>> .. >>>> D0919 14:54:36.312362846 51030 completion_queue.c:812] PENDING >>>> TAGS: *0x7f4c22edb020* >>>> >>>> >>>> The real reason to look here is why isn't the call making any progress >>>> (I am assuming the sends are successful but the receives aren't coming) - >>>> In other words, I think the threads are waiting in epoll_wait and cv_wait >>>> because there is nothing incoming on the fds. >>>> >>>> 1) Could you turn on the "tcp" traces as well? : i.e export >>>> GRPC_TRACE = "tcp,api,call_combiner,op_failure,client_channel" >>>> 2) Could you sync to the latest HEAD ? (we did fix a few bugs recently >>>> in timers and the polling engine very recently.. want to make sure you are >>>> on the latest HEAD. Based on the line numbers in callstacks, looks like >>>> your branch is a few days behind) >>>> >>>> >>>> thanks, >>>> Sree >>>> >>>> PS: >>>> >> Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx >>>> doesn't get called is because exec_ctx->active_combiner is actually NULL >>>> and so the function just returns false. Not yet sure why / how that became >>>> NULL >>>> This happens when we executed all the closures on the combiner .. so >>>> this may be a red-herring (maybe not..but I it would help to look at tcp >>>> traces first) >>>> >>>> >>>> >>>> On Tue, Sep 19, 2017 at 4:58 PM, ryan via grpc.io < >>>> [email protected]> wrote: >>>> >>>>> >>>>> Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx >>>>> doesn't get called is because exec_ctx->active_combiner is actually NULL >>>>> and so the function just returns false. Not yet sure why / how that became >>>>> NULL >>>>> >>>>> >>>>> On Tuesday, September 19, 2017 at 9:55:10 AM UTC-7, [email protected] >>>>> wrote: >>>>>> >>>>>> Err... Actually I just realized that my source files had changed so >>>>>> that stack isn't entirely accurate. Here is a proper stack trace: >>>>>> >>>>>> #0 0x00007f6ddb4a1a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () >>>>>>> from /lib64/libpthread.so.0 >>>>>>> #1 0x00007f6dcbf4471c in gpr_cv_wait (cv=0x7f6db13f8a90, >>>>>>> mu=0x7f6a65b9cd90, abs_deadline=...) at >>>>>>> src/core/lib/support/sync_posix.c:80 >>>>>>> #2 0x00007f6dcbe9bea8 in begin_worker (pollset=0x7f6a65b9cd90, >>>>>>> worker=0x7f6db13f8a70, worker_hdl=0x0, now=0x7f6db13f8a30, >>>>>>> deadline=...) at >>>>>>> src/core/lib/iomgr/ev_epoll1_linux.c:747 >>>>>>> #3 0x00007f6dcbe9c7c9 in pollset_work (exec_ctx=0x7f6db13f8b80, >>>>>>> ps=0x7f6a65b9cd90, worker_hdl=0x0, now=..., deadline=...) at >>>>>>> src/core/lib/iomgr/ev_epoll1_linux.c:941 >>>>>>> #4 0x00007f6dcbea9691 in grpc_pollset_work >>>>>>> (exec_ctx=0x7f6db13f8b80, pollset=0x7f6a65b9cd90, worker=0x0, now=..., >>>>>>> deadline=...) at src/core/lib/iomgr/ev_posix.c:210 >>>>>>> #5 0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65b9cc80, deadline=..., >>>>>>> reserved=0x0) at src/core/lib/surface/completion_queue.c:914 >>>>>>> #6 0x00007f6dcbecc5d2 in grpc_completion_queue_next >>>>>>> (cq=0x7f6a65b9cc80, deadline=..., reserved=0x0) at >>>>>>> src/core/lib/surface/completion_queue.c:992 >>>>>>> #7 0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch >>>>>>> (this_=0x7f6a9288a670, actions=...) at >>>>>>> /opt/build/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:583 >>>>>> >>>>>> >>>>>> and >>>>>> >>>>>> #0 0x00007f6dd56b6d43 in epoll_wait () from /lib64/libc.so.6 >>>>>>> #1 0x00007f6dcbe9b9e5 in do_epoll_wait (exec_ctx=0x7f6d99ff8b80, >>>>>>> ps=0x7f6a65867710, now=..., deadline=...) at >>>>>>> src/core/lib/iomgr/ev_epoll1_linux.c:640 >>>>>>> #2 0x00007f6dcbe9c972 in pollset_work (exec_ctx=0x7f6d99ff8b80, >>>>>>> ps=0x7f6a65867710, worker_hdl=0x0, now=..., deadline=...) at >>>>>>> src/core/lib/iomgr/ev_epoll1_linux.c:964 >>>>>>> #3 0x00007f6dcbea9691 in grpc_pollset_work >>>>>>> (exec_ctx=0x7f6d99ff8b80, pollset=0x7f6a65867710, worker=0x0, now=..., >>>>>>> deadline=...) at src/core/lib/iomgr/ev_posix.c:210 >>>>>>> #4 0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65867600, deadline=..., >>>>>>> reserved=0x0) at src/core/lib/surface/completion_queue.c:914 >>>>>>> #5 0x00007f6dcbecc5d2 in grpc_completion_queue_next >>>>>>> (cq=0x7f6a65867600, deadline=..., reserved=0x0) at >>>>>>> src/core/lib/surface/completion_queue.c:992 >>>>>>> #6 0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch >>>>>>> (this_=0x7f6a624e0670, actions=...) at >>>>>>> /opt/build/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:583 >>>>>> >>>>>> >>>>>> These are both proper examples of it just timing out on >>>>>> epoll_wait/pthread_cond_timedwait. >>>>>> >>>>>> The other thing I've noticed is on the bad request: >>>>>> >>>>>>> >>>>>>> D0919 14:54:36.312349629 51030 combiner.c:163] C:0x7f54faf35c00 >>>>>>> grpc_combiner_execute c=0x7f4c22edbfb0 *last=61995* >>>>>> >>>>>> >>>>>> last is 61995 which seems to indicate the number of scheduled items >>>>>> in the queue. In a normal request this number is kept low. >>>>>> >>>>>> On Tuesday, September 19, 2017 at 9:19:11 AM UTC-7, Michael Lumish >>>>>> wrote: >>>>>>> >>>>>>> It looks like your stack trace includes a call to >>>>>>> grpc_server_request_registered_call with a NULL server. That is >>>>>>> suspicious. >>>>>>> >>>>>>> On Tue, Sep 19, 2017 at 8:41 AM ryan via grpc.io < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> >>>>>>>> Hello all - >>>>>>>> >>>>>>>> I've been developing a gRPC-HHVM extension >>>>>>>> <https://github.com/grpc/grpc/pull/11553>. In the process of >>>>>>>> stress testing this extension we've consistently seen a condition where >>>>>>>> gRPC gets into a state where any request that is made gets indefinitely >>>>>>>> stuck on the epoll_wait syscall (or times out if a timeout is set on >>>>>>>> grpc_completion_queue_next). Everything works fine and is chugging >>>>>>>> along >>>>>>>> and then this problem happens and all of the requests get >>>>>>>> stuck/timeout. >>>>>>>> This is what it looks like in gdb: >>>>>>>> >>>>>>>> Thread 21 (Thread 0x7f54f6bff700 (LWP 51033)): >>>>>>>> >>>>>>>> #0 0x00007f554ad90d43 in epoll_wait () from /lib64/libc.so.6 *<-- >>>>>>>> indefinitely stuck or times out if one is set* >>>>>>>> >>>>>>>> #1 0x00007f55416969e5 in add_poll_object (exec_ctx=0x2482ad88, >>>>>>>> bag=0x188924, bag_type=(POLL_OBJ_POLLSET | unknown: 32596), item=0x0, >>>>>>>> item_type=(POLL_OBJ_POLLSET | unknown: 1097427428)) at >>>>>>>> src/core/lib/iomgr/ev_epollsig_linux.c:1484 >>>>>>>> >>>>>>>> #2 0x00007f5541697972 in fd_ref (fd=0x4, reason=0x33b8b87c0 >>>>>>>> <Address 0x33b8b87c0 out of bounds>, file=0x0, line=0) at >>>>>>>> src/core/lib/iomgr/ev_poll_posix.c:447 >>>>>>>> >>>>>>>> #3 0x00007f55416a4691 in tcp_drop_uncovered_then_handle_write >>>>>>>> (exec_ctx=0x7f54f6bf8680, arg=0x7f4c4e2c9010, error=0x0) at >>>>>>>> src/core/lib/iomgr/tcp_posix.c:239 >>>>>>>> >>>>>>>> #4 0x00007f55416c71a8 in grpc_server_request_registered_call >>>>>>>> (server=0x0, rmp=0x0, call=0x2482ad88, deadline=0x188924, >>>>>>>> initial_metadata=0x7f4c4e2c8f00, optional_payload=0x0, >>>>>>>> cq_bound_to_call=0x4, cq_for_notification=0x33b8b87c0, >>>>>>>> >>>>>>>> tag=0x0) at src/core/lib/surface/server.c:1486 >>>>>>>> >>>>>>>> >>>>>>>> #5 0x00007f55416c75d2 in conforms_to (slice=..., >>>>>>>> legal_bits=0x7f4c22ce2e50 "\270\064\245\214L\177", >>>>>>>> err_desc=0x7f54f6bf8701 >>>>>>>> "\211\030") at src/core/lib/surface/validate_metadata.c:40 >>>>>>>> >>>>>>>> #6 0x00007f5541b97105 in HPHP::c_Call_ni_startBatch >>>>>>>> (this_=0x7f4c22ce2e60, actions=...) at >>>>>>>> /home/ryangordon/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:529 >>>>>>>> >>>>>>>> >>>>>>>> The stack trace doesn't always look like that but it always is >>>>>>>> stuck on epoll_wait/epoll_pwait often from gpr_cv_wait. >>>>>>>> >>>>>>>> >>>>>>>> I've managed to get GPRC_TRACE=all GRPC_VERBOSITY=DEBUG of a good >>>>>>>> request and a request when it is in the bad state to compare side by >>>>>>>> side. >>>>>>>> >>>>>>>> >>>>>>>> Good Request: >>>>>>>> https://gist.githubusercontent.com/RyanGordon/831a8213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b64ab661a80e0c27/Not-Frozen%2520Log%2520Snippet >>>>>>>> >>>>>>>> Bad Request: >>>>>>>> https://gist.githubusercontent.com/RyanGordon/831a8213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b64ab661a80e0c27/Frozen%2520log%2520snippet >>>>>>>> >>>>>>>> >>>>>>>> The bad request looks like this when it's execution deviates: >>>>>>>> >>>>>>>> >>>>>>>> D0919 14:54:36.312349629 51030 combiner.c:163] >>>>>>>> C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995 >>>>>>>> D0919 14:54:36.312353011 51030 exec_ctx.c:73] closure >>>>>>>> 0x7f4c22edbec0 finished >>>>>>>> I0919 14:54:36.312358626 51030 completion_queue.c:834] >>>>>>>> grpc_completion_queue_next(cq=0x7f4c22e9e200, deadline=gpr_timespec { >>>>>>>> tv_sec: 4, tv_nsec: 999000000, clock_type: 3 }, reserved=(nil)) >>>>>>>> >>>>>>>> ... >>>>>>>> >>>>>>>> >>>>>>>> Whereas, the good request does this: >>>>>>>> >>>>>>>> D0915 15:19:50.178995777 13689 combiner.c:163] >>>>>>>> C:0x7f932172f700 grpc_combiner_execute c=0x7f9313d67fb0 last=1 >>>>>>>> D0915 15:19:50.179005880 13689 exec_ctx.c:73] closure >>>>>>>> 0x7f9313d67ec0 finished >>>>>>>> D0915 15:19:50.179074826 13689 combiner.c:225] >>>>>>>> C:0x7f932172f700 grpc_combiner_continue_exec_ctx contended=0 >>>>>>>> exec_ctx_ready_to_finish=1 time_to_execute_final_list=0 >>>>>>>> D0915 15:19:50.179077765 13689 combiner.c:243] >>>>>>>> C:0x7f932172f700 maybe_finish_one n=0x7f9313d67fb0 >>>>>>>> >>>>>>>> ... (and a whole lot more) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Effectively it seems like the something is not getting scheduled when >>>>>>>> it should be (grpc_combiner_continue_exec_ctx never gets run). I've >>>>>>>> been working on this for a few weeks now and haven't been able to >>>>>>>> track down why or a fix. I have the full logs if they might be >>>>>>>> helpful. I've been testing on my branch that is kept pretty close to >>>>>>>> grpc/grpc master. >>>>>>>> >>>>>>>> >>>>>>>> Does anyone have any idea what is going on? I have the bad process >>>>>>>> open on a test server in gdb and have timeouts implemented so I can >>>>>>>> run repeated requests at it to continue debugging if necessary. >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "grpc.io" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at https://groups.google.com/group/grpc-io. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "grpc.io" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/grpc-io. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/grpc-io/d8cd586a-d329-4a51-a030-fff17c458e7d%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/grpc-io/d8cd586a-d329-4a51-a030-fff17c458e7d%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "grpc.io" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/grpc-io. >>> >> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/grpc-io/0924073e-285e-444e-aa95-b500dbb34320%40googlegroups.com >>> <https://groups.google.com/d/msgid/grpc-io/0924073e-285e-444e-aa95-b500dbb34320%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups " > grpc.io" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/grpc-io. > To view this discussion on the web visit > https://groups.google.com/d/msgid/grpc-io/cf6ecb95-47be-4b0a-bcc1-d79b22ee5909%40googlegroups.com > <https://groups.google.com/d/msgid/grpc-io/cf6ecb95-47be-4b0a-bcc1-d79b22ee5909%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/grpc-io. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/CAPK2-4d22UZKK0j%2BQv--utorev6GWaKfbNH9194Fbenr4t9EnA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
smime.p7s
Description: S/MIME Cryptographic Signature
