1) Yep, will do - I should continue to run with GRPC_VERBOSITY=DEBUG right? 2) Just merged latest master through
On Wednesday, September 20, 2017 at 9:32:54 AM UTC-7, Sree Kuchibhotla wrote: > > The line does look a bit suspicious to me (It means the combiner has 30597 > closures - which is a lot - the least significant bit in "last" is ignored > - the actual count is stored in the remaining bits) > > D0919 14:54:36.312349629 51030 combiner.c:163] C:0x7f54faf35c00 > grpc_combiner_execute c=0x7f4c22edbfb0 last=61995 > > Not sure why we ran in to this situation. Putting that aside for now, > based on the logs (in the failed case), the threads are waiting for the > batch to complete > > I0919 14:54:36.312170818 51030 call.c:2069] > grpc_call_start_batch(call=0x7f4c22edb020, ops=0x7f550e39fde0, nops=4, tag= > *0x7f4c22edb020*, reserved=(nil)) > I0919 14:54:36.312199708 51030 call.c:1677] ops[0]: > SEND_INITIAL_METADATA > .. > I0919 14:54:36.312210605 51030 call.c:1677] ops[1]: > RECV_INITIAL_METADATA ptr=0x7f4c22e9e510 > I0919 14:54:36.312214004 51030 call.c:1677] ops[2]: > SEND_MESSAGE ptr=0x7f4c262df200 > I0919 14:54:36.312216804 51030 call.c:1677] ops[3]: > SEND_CLOSE_FROM_CLIENT > .. > D0919 14:54:36.312362846 51030 completion_queue.c:812] PENDING TAGS: > *0x7f4c22edb020* > > > The real reason to look here is why isn't the call making any progress (I > am assuming the sends are successful but the receives aren't coming) - In > other words, I think the threads are waiting in epoll_wait and cv_wait > because there is nothing incoming on the fds. > > 1) Could you turn on the "tcp" traces as well? : i.e export GRPC_TRACE = > "tcp,api,call_combiner,op_failure,client_channel" > 2) Could you sync to the latest HEAD ? (we did fix a few bugs recently in > timers and the polling engine very recently.. want to make sure you are on > the latest HEAD. Based on the line numbers in callstacks, looks like your > branch is a few days behind) > > > thanks, > Sree > > PS: > >> Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx > doesn't get called is because exec_ctx->active_combiner is actually NULL > and so the function just returns false. Not yet sure why / how that became > NULL > This happens when we executed all the closures on the combiner .. so this > may be a red-herring (maybe not..but I it would help to look at tcp traces > first) > > > > On Tue, Sep 19, 2017 at 4:58 PM, ryan via grpc.io <[email protected] > <javascript:>> wrote: > >> >> Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx >> doesn't get called is because exec_ctx->active_combiner is actually NULL >> and so the function just returns false. Not yet sure why / how that became >> NULL >> >> >> On Tuesday, September 19, 2017 at 9:55:10 AM UTC-7, [email protected] >> wrote: >>> >>> Err... Actually I just realized that my source files had changed so that >>> stack isn't entirely accurate. Here is a proper stack trace: >>> >>> #0 0x00007f6ddb4a1a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from >>>> /lib64/libpthread.so.0 >>>> #1 0x00007f6dcbf4471c in gpr_cv_wait (cv=0x7f6db13f8a90, >>>> mu=0x7f6a65b9cd90, abs_deadline=...) at >>>> src/core/lib/support/sync_posix.c:80 >>>> #2 0x00007f6dcbe9bea8 in begin_worker (pollset=0x7f6a65b9cd90, >>>> worker=0x7f6db13f8a70, worker_hdl=0x0, now=0x7f6db13f8a30, deadline=...) >>>> at >>>> src/core/lib/iomgr/ev_epoll1_linux.c:747 >>>> #3 0x00007f6dcbe9c7c9 in pollset_work (exec_ctx=0x7f6db13f8b80, >>>> ps=0x7f6a65b9cd90, worker_hdl=0x0, now=..., deadline=...) at >>>> src/core/lib/iomgr/ev_epoll1_linux.c:941 >>>> #4 0x00007f6dcbea9691 in grpc_pollset_work (exec_ctx=0x7f6db13f8b80, >>>> pollset=0x7f6a65b9cd90, worker=0x0, now=..., deadline=...) at >>>> src/core/lib/iomgr/ev_posix.c:210 >>>> #5 0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65b9cc80, deadline=..., >>>> reserved=0x0) at src/core/lib/surface/completion_queue.c:914 >>>> #6 0x00007f6dcbecc5d2 in grpc_completion_queue_next >>>> (cq=0x7f6a65b9cc80, deadline=..., reserved=0x0) at >>>> src/core/lib/surface/completion_queue.c:992 >>>> #7 0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch >>>> (this_=0x7f6a9288a670, actions=...) at >>>> /opt/build/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:583 >>> >>> >>> and >>> >>> #0 0x00007f6dd56b6d43 in epoll_wait () from /lib64/libc.so.6 >>>> #1 0x00007f6dcbe9b9e5 in do_epoll_wait (exec_ctx=0x7f6d99ff8b80, >>>> ps=0x7f6a65867710, now=..., deadline=...) at >>>> src/core/lib/iomgr/ev_epoll1_linux.c:640 >>>> #2 0x00007f6dcbe9c972 in pollset_work (exec_ctx=0x7f6d99ff8b80, >>>> ps=0x7f6a65867710, worker_hdl=0x0, now=..., deadline=...) at >>>> src/core/lib/iomgr/ev_epoll1_linux.c:964 >>>> #3 0x00007f6dcbea9691 in grpc_pollset_work (exec_ctx=0x7f6d99ff8b80, >>>> pollset=0x7f6a65867710, worker=0x0, now=..., deadline=...) at >>>> src/core/lib/iomgr/ev_posix.c:210 >>>> #4 0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65867600, deadline=..., >>>> reserved=0x0) at src/core/lib/surface/completion_queue.c:914 >>>> #5 0x00007f6dcbecc5d2 in grpc_completion_queue_next >>>> (cq=0x7f6a65867600, deadline=..., reserved=0x0) at >>>> src/core/lib/surface/completion_queue.c:992 >>>> #6 0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch >>>> (this_=0x7f6a624e0670, actions=...) at >>>> /opt/build/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:583 >>> >>> >>> These are both proper examples of it just timing out on >>> epoll_wait/pthread_cond_timedwait. >>> >>> The other thing I've noticed is on the bad request: >>> >>>> >>>> D0919 14:54:36.312349629 51030 combiner.c:163] C:0x7f54faf35c00 >>>> grpc_combiner_execute c=0x7f4c22edbfb0 *last=61995* >>> >>> >>> last is 61995 which seems to indicate the number of scheduled items in >>> the queue. In a normal request this number is kept low. >>> >>> On Tuesday, September 19, 2017 at 9:19:11 AM UTC-7, Michael Lumish wrote: >>>> >>>> It looks like your stack trace includes a call to >>>> grpc_server_request_registered_call >>>> with a NULL server. That is suspicious. >>>> >>>> On Tue, Sep 19, 2017 at 8:41 AM ryan via grpc.io < >>>> [email protected]> wrote: >>>> >>>>> >>>>> Hello all - >>>>> >>>>> I've been developing a gRPC-HHVM extension >>>>> <https://github.com/grpc/grpc/pull/11553>. In the process of stress >>>>> testing this extension we've consistently seen a condition where gRPC >>>>> gets >>>>> into a state where any request that is made gets indefinitely stuck on >>>>> the >>>>> epoll_wait syscall (or times out if a timeout is set on >>>>> grpc_completion_queue_next). Everything works fine and is chugging along >>>>> and then this problem happens and all of the requests get stuck/timeout. >>>>> This is what it looks like in gdb: >>>>> >>>>> Thread 21 (Thread 0x7f54f6bff700 (LWP 51033)): >>>>> >>>>> #0 0x00007f554ad90d43 in epoll_wait () from /lib64/libc.so.6 *<-- >>>>> indefinitely stuck or times out if one is set* >>>>> >>>>> #1 0x00007f55416969e5 in add_poll_object (exec_ctx=0x2482ad88, >>>>> bag=0x188924, bag_type=(POLL_OBJ_POLLSET | unknown: 32596), item=0x0, >>>>> item_type=(POLL_OBJ_POLLSET | unknown: 1097427428)) at >>>>> src/core/lib/iomgr/ev_epollsig_linux.c:1484 >>>>> >>>>> #2 0x00007f5541697972 in fd_ref (fd=0x4, reason=0x33b8b87c0 <Address >>>>> 0x33b8b87c0 out of bounds>, file=0x0, line=0) at >>>>> src/core/lib/iomgr/ev_poll_posix.c:447 >>>>> >>>>> #3 0x00007f55416a4691 in tcp_drop_uncovered_then_handle_write >>>>> (exec_ctx=0x7f54f6bf8680, arg=0x7f4c4e2c9010, error=0x0) at >>>>> src/core/lib/iomgr/tcp_posix.c:239 >>>>> >>>>> #4 0x00007f55416c71a8 in grpc_server_request_registered_call >>>>> (server=0x0, rmp=0x0, call=0x2482ad88, deadline=0x188924, >>>>> initial_metadata=0x7f4c4e2c8f00, optional_payload=0x0, >>>>> cq_bound_to_call=0x4, cq_for_notification=0x33b8b87c0, >>>>> >>>>> tag=0x0) at src/core/lib/surface/server.c:1486 >>>>> >>>>> >>>>> #5 0x00007f55416c75d2 in conforms_to (slice=..., >>>>> legal_bits=0x7f4c22ce2e50 "\270\064\245\214L\177", >>>>> err_desc=0x7f54f6bf8701 >>>>> "\211\030") at src/core/lib/surface/validate_metadata.c:40 >>>>> >>>>> #6 0x00007f5541b97105 in HPHP::c_Call_ni_startBatch >>>>> (this_=0x7f4c22ce2e60, actions=...) at >>>>> /home/ryangordon/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:529 >>>>> >>>>> >>>>> The stack trace doesn't always look like that but it always is stuck >>>>> on epoll_wait/epoll_pwait often from gpr_cv_wait. >>>>> >>>>> >>>>> I've managed to get GPRC_TRACE=all GRPC_VERBOSITY=DEBUG of a good >>>>> request and a request when it is in the bad state to compare side by side. >>>>> >>>>> >>>>> Good Request: >>>>> https://gist.githubusercontent.com/RyanGordon/831a8213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b64ab661a80e0c27/Not-Frozen%2520Log%2520Snippet >>>>> >>>>> Bad Request: >>>>> https://gist.githubusercontent.com/RyanGordon/831a8213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b64ab661a80e0c27/Frozen%2520log%2520snippet >>>>> >>>>> >>>>> The bad request looks like this when it's execution deviates: >>>>> >>>>> >>>>> D0919 14:54:36.312349629 51030 combiner.c:163] >>>>> C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995 >>>>> D0919 14:54:36.312353011 51030 exec_ctx.c:73] closure >>>>> 0x7f4c22edbec0 finished >>>>> I0919 14:54:36.312358626 51030 completion_queue.c:834] >>>>> grpc_completion_queue_next(cq=0x7f4c22e9e200, deadline=gpr_timespec { >>>>> tv_sec: 4, tv_nsec: 999000000, clock_type: 3 }, reserved=(nil)) >>>>> >>>>> ... >>>>> >>>>> >>>>> Whereas, the good request does this: >>>>> >>>>> D0915 15:19:50.178995777 13689 combiner.c:163] >>>>> C:0x7f932172f700 grpc_combiner_execute c=0x7f9313d67fb0 last=1 >>>>> D0915 15:19:50.179005880 13689 exec_ctx.c:73] closure >>>>> 0x7f9313d67ec0 finished >>>>> D0915 15:19:50.179074826 13689 combiner.c:225] >>>>> C:0x7f932172f700 grpc_combiner_continue_exec_ctx contended=0 >>>>> exec_ctx_ready_to_finish=1 time_to_execute_final_list=0 >>>>> D0915 15:19:50.179077765 13689 combiner.c:243] >>>>> C:0x7f932172f700 maybe_finish_one n=0x7f9313d67fb0 >>>>> >>>>> ... (and a whole lot more) >>>>> >>>>> >>>>> >>>>> >>>>> Effectively it seems like the something is not getting scheduled when it >>>>> should be (grpc_combiner_continue_exec_ctx never gets run). I've been >>>>> working on this for a few weeks now and haven't been able to track down >>>>> why or a fix. I have the full logs if they might be helpful. I've been >>>>> testing on my branch that is kept pretty close to grpc/grpc master. >>>>> >>>>> >>>>> Does anyone have any idea what is going on? I have the bad process open >>>>> on a test server in gdb and have timeouts implemented so I can run >>>>> repeated requests at it to continue debugging if necessary. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Ryan >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "grpc.io" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/grpc-io. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "grpc.io" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/grpc-io. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/grpc-io/d8cd586a-d329-4a51-a030-fff17c458e7d%40googlegroups.com >> >> <https://groups.google.com/d/msgid/grpc-io/d8cd586a-d329-4a51-a030-fff17c458e7d%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "grpc.io" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/grpc-io. To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/0924073e-285e-444e-aa95-b500dbb34320%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
