I just got it to reproduce. Here are the last ~3,000 lines of the debug 
output: https://gist.github.com/RyanGordon/e3d211a6430daac560296a4b301f121b

At the end, it seemed like all of the requests timed out, so we called `
grpc_call_cancel_with_status 
<https://github.com/quizlet/grpc/blob/hhvm-bz2/src/php/ext/grpc/hhvm/call.cpp#L568>`
 
to clean the request up. After we call that, we call `
grpc_completion_queue_pluck 
<https://github.com/quizlet/grpc/blob/hhvm-bz2/src/php/ext/grpc/hhvm/call.cpp#L572>`
 
(justification 
<https://github.com/grpc/grpc/pull/12306#issuecomment-330337644>) with an 
infinite timeout so it gets stuck there forever. After about 4 minutes of 
nothing happening, gRPC reports a "Socket closed" error, but nothing after 
that.

The one line that sticks out to me is L2677 from the gist. Typically 
`grpc_call_combiner_stop()` gets called right after that line but not in 
this case and then the behavior shown in the logs look atypical after that.

On Wednesday, September 20, 2017 at 12:57:35 PM UTC-7, Sree Kuchibhotla 
wrote:
>
> Yes, GRPC_VERBOSITY=DEBUG.
>
> On Wed, Sep 20, 2017 at 12:10 PM, ryan via grpc.io <
> [email protected] <javascript:>> wrote:
>
>> 1) Yep, will do - I should continue to run with GRPC_VERBOSITY=DEBUG 
>> right?
>> 2) Just merged latest master through
>>
>>
>> On Wednesday, September 20, 2017 at 9:32:54 AM UTC-7, Sree Kuchibhotla 
>> wrote:
>>>
>>> The line does look a bit suspicious to me (It means the combiner 
>>> has 30597 closures - which is a lot - the least significant bit in "last" 
>>> is ignored - the actual count is stored in the remaining bits)
>>>
>>> D0919 14:54:36.312349629   51030 combiner.c:163]             
>>> C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995
>>>
>>> Not sure why we ran in to this situation. Putting that aside for now, 
>>> based on the logs (in the failed case), the threads are waiting for the 
>>> batch to complete 
>>>
>>> I0919 14:54:36.312170818   51030 call.c:2069]               
>>>  grpc_call_start_batch(call=0x7f4c22edb020, ops=0x7f550e39fde0, nops=4, tag=
>>> *0x7f4c22edb020*, reserved=(nil))
>>> I0919 14:54:36.312199708   51030 call.c:1677]                ops[0]: 
>>> SEND_INITIAL_METADATA
>>> ..
>>> I0919 14:54:36.312210605   51030 call.c:1677]                ops[1]: 
>>> RECV_INITIAL_METADATA ptr=0x7f4c22e9e510
>>> I0919 14:54:36.312214004   51030 call.c:1677]                ops[2]: 
>>> SEND_MESSAGE ptr=0x7f4c262df200
>>> I0919 14:54:36.312216804   51030 call.c:1677]                ops[3]: 
>>> SEND_CLOSE_FROM_CLIENT
>>> ..
>>> D0919 14:54:36.312362846   51030 completion_queue.c:812]     PENDING 
>>> TAGS: *0x7f4c22edb020*
>>>
>>>
>>> The real reason to look here is why isn't the call making any progress 
>>> (I am assuming the sends are successful but the receives aren't coming) - 
>>> In other words, I think the threads are waiting in epoll_wait and cv_wait 
>>> because there is nothing incoming on the fds. 
>>>
>>> 1) Could you turn on the "tcp" traces as well? : i.e  export GRPC_TRACE 
>>> = "tcp,api,call_combiner,op_failure,client_channel"  
>>> 2) Could you sync to the latest HEAD ? (we did fix a few bugs recently 
>>> in timers and the polling engine very recently.. want to make sure you are 
>>> on the latest HEAD. Based on the line numbers in callstacks, looks like 
>>> your branch is a few days behind)
>>>
>>>
>>> thanks,
>>> Sree
>>>
>>> PS: 
>>> >> Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx 
>>> doesn't get called is because exec_ctx->active_combiner is actually NULL 
>>> and so the function just returns false. Not yet sure why / how that became 
>>> NULL
>>> This happens when we executed all the closures on the combiner .. so 
>>> this may be a red-herring (maybe not..but I it would help to look at tcp 
>>> traces first)
>>>
>>>
>>>
>>> On Tue, Sep 19, 2017 at 4:58 PM, ryan via grpc.io <
>>> [email protected]> wrote:
>>>
>>>>
>>>> Hmm so it looks like the reason why grpc_combiner_continue_exec_ctx 
>>>> doesn't get called is because exec_ctx->active_combiner is actually NULL 
>>>> and so the function just returns false. Not yet sure why / how that became 
>>>> NULL
>>>>
>>>>
>>>> On Tuesday, September 19, 2017 at 9:55:10 AM UTC-7, [email protected] 
>>>> wrote:
>>>>>
>>>>> Err... Actually I just realized that my source files had changed so 
>>>>> that stack isn't entirely accurate. Here is a proper stack trace:
>>>>>
>>>>> #0  0x00007f6ddb4a1a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
>>>>>> /lib64/libpthread.so.0
>>>>>> #1  0x00007f6dcbf4471c in gpr_cv_wait (cv=0x7f6db13f8a90, 
>>>>>> mu=0x7f6a65b9cd90, abs_deadline=...) at 
>>>>>> src/core/lib/support/sync_posix.c:80
>>>>>> #2  0x00007f6dcbe9bea8 in begin_worker (pollset=0x7f6a65b9cd90, 
>>>>>> worker=0x7f6db13f8a70, worker_hdl=0x0, now=0x7f6db13f8a30, deadline=...) 
>>>>>> at 
>>>>>> src/core/lib/iomgr/ev_epoll1_linux.c:747
>>>>>> #3  0x00007f6dcbe9c7c9 in pollset_work (exec_ctx=0x7f6db13f8b80, 
>>>>>> ps=0x7f6a65b9cd90, worker_hdl=0x0, now=..., deadline=...) at 
>>>>>> src/core/lib/iomgr/ev_epoll1_linux.c:941
>>>>>> #4  0x00007f6dcbea9691 in grpc_pollset_work (exec_ctx=0x7f6db13f8b80, 
>>>>>> pollset=0x7f6a65b9cd90, worker=0x0, now=..., deadline=...) at 
>>>>>> src/core/lib/iomgr/ev_posix.c:210
>>>>>> #5  0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65b9cc80, deadline=..., 
>>>>>> reserved=0x0) at src/core/lib/surface/completion_queue.c:914
>>>>>> #6  0x00007f6dcbecc5d2 in grpc_completion_queue_next 
>>>>>> (cq=0x7f6a65b9cc80, deadline=..., reserved=0x0) at 
>>>>>> src/core/lib/surface/completion_queue.c:992
>>>>>> #7  0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch 
>>>>>> (this_=0x7f6a9288a670, actions=...) at 
>>>>>> /opt/build/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:583
>>>>>
>>>>>
>>>>> and
>>>>>
>>>>> #0  0x00007f6dd56b6d43 in epoll_wait () from /lib64/libc.so.6
>>>>>> #1  0x00007f6dcbe9b9e5 in do_epoll_wait (exec_ctx=0x7f6d99ff8b80, 
>>>>>> ps=0x7f6a65867710, now=..., deadline=...) at 
>>>>>> src/core/lib/iomgr/ev_epoll1_linux.c:640
>>>>>> #2  0x00007f6dcbe9c972 in pollset_work (exec_ctx=0x7f6d99ff8b80, 
>>>>>> ps=0x7f6a65867710, worker_hdl=0x0, now=..., deadline=...) at 
>>>>>> src/core/lib/iomgr/ev_epoll1_linux.c:964
>>>>>> #3  0x00007f6dcbea9691 in grpc_pollset_work (exec_ctx=0x7f6d99ff8b80, 
>>>>>> pollset=0x7f6a65867710, worker=0x0, now=..., deadline=...) at 
>>>>>> src/core/lib/iomgr/ev_posix.c:210
>>>>>> #4  0x00007f6dcbecc1a8 in cq_next (cq=0x7f6a65867600, deadline=..., 
>>>>>> reserved=0x0) at src/core/lib/surface/completion_queue.c:914
>>>>>> #5  0x00007f6dcbecc5d2 in grpc_completion_queue_next 
>>>>>> (cq=0x7f6a65867600, deadline=..., reserved=0x0) at 
>>>>>> src/core/lib/surface/completion_queue.c:992
>>>>>> #6  0x00007f6dcc394ecc in HPHP::c_Call_ni_startBatch 
>>>>>> (this_=0x7f6a624e0670, actions=...) at 
>>>>>> /opt/build/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:583 
>>>>>
>>>>>
>>>>> These are both proper examples of it just timing out on 
>>>>> epoll_wait/pthread_cond_timedwait.
>>>>>
>>>>> The other thing I've noticed is on the bad request:
>>>>>
>>>>>>
>>>>>> D0919 14:54:36.312349629 51030 combiner.c:163] C:0x7f54faf35c00 
>>>>>> grpc_combiner_execute c=0x7f4c22edbfb0 *last=61995*
>>>>>
>>>>>
>>>>> last is 61995 which seems to indicate the number of scheduled items in 
>>>>> the queue. In a normal request this number is kept low.
>>>>>
>>>>> On Tuesday, September 19, 2017 at 9:19:11 AM UTC-7, Michael Lumish 
>>>>> wrote:
>>>>>>
>>>>>> It looks like your stack trace includes a call to 
>>>>>> grpc_server_request_registered_call with a NULL server. That is 
>>>>>> suspicious.
>>>>>>
>>>>>> On Tue, Sep 19, 2017 at 8:41 AM ryan via grpc.io <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hello all -
>>>>>>>
>>>>>>> I've been developing a gRPC-HHVM extension 
>>>>>>> <https://github.com/grpc/grpc/pull/11553>. In the process of stress 
>>>>>>> testing this extension we've consistently seen a condition where gRPC 
>>>>>>> gets 
>>>>>>> into a state where any request that is made gets indefinitely stuck on 
>>>>>>> the 
>>>>>>> epoll_wait syscall (or times out if a timeout is set on 
>>>>>>> grpc_completion_queue_next). Everything works fine and is chugging 
>>>>>>> along 
>>>>>>> and then this problem happens and all of the requests get 
>>>>>>> stuck/timeout. 
>>>>>>> This is what it looks like in gdb:
>>>>>>>
>>>>>>> Thread 21 (Thread 0x7f54f6bff700 (LWP 51033)):
>>>>>>>
>>>>>>> #0  0x00007f554ad90d43 in epoll_wait () from /lib64/libc.so.6 *<-- 
>>>>>>> indefinitely stuck or times out if one is set*
>>>>>>>
>>>>>>> #1  0x00007f55416969e5 in add_poll_object (exec_ctx=0x2482ad88, 
>>>>>>> bag=0x188924, bag_type=(POLL_OBJ_POLLSET | unknown: 32596), item=0x0, 
>>>>>>> item_type=(POLL_OBJ_POLLSET | unknown: 1097427428)) at 
>>>>>>> src/core/lib/iomgr/ev_epollsig_linux.c:1484
>>>>>>>
>>>>>>> #2  0x00007f5541697972 in fd_ref (fd=0x4, reason=0x33b8b87c0 
>>>>>>> <Address 0x33b8b87c0 out of bounds>, file=0x0, line=0) at 
>>>>>>> src/core/lib/iomgr/ev_poll_posix.c:447
>>>>>>>
>>>>>>> #3  0x00007f55416a4691 in tcp_drop_uncovered_then_handle_write 
>>>>>>> (exec_ctx=0x7f54f6bf8680, arg=0x7f4c4e2c9010, error=0x0) at 
>>>>>>> src/core/lib/iomgr/tcp_posix.c:239
>>>>>>>
>>>>>>> #4  0x00007f55416c71a8 in grpc_server_request_registered_call 
>>>>>>> (server=0x0, rmp=0x0, call=0x2482ad88, deadline=0x188924, 
>>>>>>> initial_metadata=0x7f4c4e2c8f00, optional_payload=0x0, 
>>>>>>> cq_bound_to_call=0x4, cq_for_notification=0x33b8b87c0, 
>>>>>>>
>>>>>>>     tag=0x0) at src/core/lib/surface/server.c:1486
>>>>>>>
>>>>>>>
>>>>>>> #5  0x00007f55416c75d2 in conforms_to (slice=..., 
>>>>>>> legal_bits=0x7f4c22ce2e50 "\270\064\245\214L\177", 
>>>>>>> err_desc=0x7f54f6bf8701 
>>>>>>> "\211\030") at src/core/lib/surface/validate_metadata.c:40
>>>>>>>
>>>>>>> #6  0x00007f5541b97105 in HPHP::c_Call_ni_startBatch 
>>>>>>> (this_=0x7f4c22ce2e60, actions=...) at 
>>>>>>> /home/ryangordon/hhvm/grpc/src/php/ext/grpc/hhvm/call.cpp:529
>>>>>>>
>>>>>>>
>>>>>>> The stack trace doesn't always look like that but it always is stuck 
>>>>>>> on epoll_wait/epoll_pwait often from gpr_cv_wait.
>>>>>>>
>>>>>>>
>>>>>>> I've managed to get GPRC_TRACE=all GRPC_VERBOSITY=DEBUG of a good 
>>>>>>> request and a request when it is in the bad state to compare side by 
>>>>>>> side.
>>>>>>>
>>>>>>>
>>>>>>> Good Request: 
>>>>>>> https://gist.githubusercontent.com/RyanGordon/831a8213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b64ab661a80e0c27/Not-Frozen%2520Log%2520Snippet
>>>>>>>
>>>>>>> Bad Request: 
>>>>>>> https://gist.githubusercontent.com/RyanGordon/831a8213e735bb1a4e5b93a854aab3c8/raw/7e37d595dbe56a77274bff83b64ab661a80e0c27/Frozen%2520log%2520snippet
>>>>>>>
>>>>>>>
>>>>>>> The bad request looks like this when it's execution deviates:
>>>>>>>
>>>>>>>
>>>>>>> D0919 14:54:36.312349629   51030 combiner.c:163]             
>>>>>>> C:0x7f54faf35c00 grpc_combiner_execute c=0x7f4c22edbfb0 last=61995
>>>>>>> D0919 14:54:36.312353011   51030 exec_ctx.c:73]              closure 
>>>>>>> 0x7f4c22edbec0 finished
>>>>>>> I0919 14:54:36.312358626   51030 completion_queue.c:834]     
>>>>>>> grpc_completion_queue_next(cq=0x7f4c22e9e200, deadline=gpr_timespec { 
>>>>>>> tv_sec: 4, tv_nsec: 999000000, clock_type: 3 }, reserved=(nil))
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>>> Whereas, the good request does this:
>>>>>>>
>>>>>>> D0915 15:19:50.178995777   13689 combiner.c:163]             
>>>>>>> C:0x7f932172f700 grpc_combiner_execute c=0x7f9313d67fb0 last=1
>>>>>>> D0915 15:19:50.179005880   13689 exec_ctx.c:73]              closure 
>>>>>>> 0x7f9313d67ec0 finished
>>>>>>> D0915 15:19:50.179074826   13689 combiner.c:225]             
>>>>>>> C:0x7f932172f700 grpc_combiner_continue_exec_ctx contended=0 
>>>>>>> exec_ctx_ready_to_finish=1 time_to_execute_final_list=0
>>>>>>> D0915 15:19:50.179077765   13689 combiner.c:243]             
>>>>>>> C:0x7f932172f700 maybe_finish_one n=0x7f9313d67fb0
>>>>>>>
>>>>>>> ... (and a whole lot more)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Effectively it seems like the something is not getting scheduled when 
>>>>>>> it should be (grpc_combiner_continue_exec_ctx never gets run). I've 
>>>>>>> been working on this for a few weeks now and haven't been able to track 
>>>>>>> down why or a fix. I have the full logs if they might be helpful. I've 
>>>>>>> been testing on my branch that is kept pretty close to grpc/grpc master.
>>>>>>>
>>>>>>>
>>>>>>> Does anyone have any idea what is going on? I have the bad process open 
>>>>>>> on a test server in gdb and have timeouts implemented so I can run 
>>>>>>> repeated requests at it to continue debugging if necessary.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "grpc.io" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/grpc-io.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/grpc-io/b3cc406f-9b93-4259-b8f9-df3eb904132a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "grpc.io" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/grpc-io.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/grpc-io/d8cd586a-d329-4a51-a030-fff17c458e7d%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/grpc-io/d8cd586a-d329-4a51-a030-fff17c458e7d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "grpc.io" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/grpc-io.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/grpc-io/0924073e-285e-444e-aa95-b500dbb34320%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/grpc-io/0924073e-285e-444e-aa95-b500dbb34320%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/cf6ecb95-47be-4b0a-bcc1-d79b22ee5909%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to