I have some time to revisit this problem today. Running with GRPC VERBOSITY
set to DEBUG gives me this (anything starts with* [2022-03-04T23* is my own
logging):
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501457] Logger level is set at info.
[2022-03-04T23:15:41+10:00] [workflow_client] [warning] [process 501457,
thread 501457] Persistent not specified and will default to false
D0304 23:15:41.640009936 501457 ev_posix.cc:172] Using polling
engine: epollex
D0304 23:15:41.640079032 501457 lb_policy_registry.cc:42] registering LB
policy factory for "grpclb"
D0304 23:15:41.640106084 501457 lb_policy_registry.cc:42] registering LB
policy factory for "priority_experimental"
D0304 23:15:41.640124027 501457 lb_policy_registry.cc:42] registering LB
policy factory for "weighted_target_experimental"
D0304 23:15:41.640127329 501457 lb_policy_registry.cc:42] registering LB
policy factory for "pick_first"
D0304 23:15:41.640150128 501457 lb_policy_registry.cc:42] registering LB
policy factory for "round_robin"
D0304 23:15:41.640153574 501457 lb_policy_registry.cc:42] registering LB
policy factory for "ring_hash_experimental"
D0304 23:15:41.640162650 501457 dns_resolver_ares.cc:624] Using ares dns
resolver
D0304 23:15:41.640230087 501457 certificate_provider_registry.cc:33]
registering certificate provider factory for "file_watcher"
D0304 23:15:41.640236782 501457 lb_policy_registry.cc:42] registering LB
policy factory for "cds_experimental"
D0304 23:15:41.640241375 501457 lb_policy_registry.cc:42] registering LB
policy factory for "xds_cluster_impl_experimental"
D0304 23:15:41.640258747 501457 lb_policy_registry.cc:42] registering LB
policy factory for "xds_cluster_resolver_experimental"
D0304 23:15:41.640262652 501457 lb_policy_registry.cc:42] registering LB
policy factory for "xds_cluster_manager_experimental"
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501457] gRPC client at localhost:4500 is registering with the
server...
I0304 23:15:41.640909979 501458 subchannel.cc:945] subchannel
0x55e6713dcea0 {address=ipv6:[::1]:4500,
args=grpc.client_channel_factory=0x55e6713d6a20,
grpc.default_authority=localhost:4500,
grpc.internal.subchannel_pool=0x55e6713bb810,
grpc.primary_user_agent=grpc-c++/1.44.0-dev,
grpc.resource_quota=0x55e6713d7b40, grpc.server_uri=dns:///localhost:4500}:
connect failed: {"created":"@1646399741.640833261","description":"Failed to
connect to remote host: Connection
refused","errno":111,"file":"/home/hoover/workspace/DPAV/Linux/OpenSource/gRPC-focal/src/core/lib/iomgr/tcp_client_posix.cc","file_line":200,"os_error":"Connection
refused","syscall":"connect","target_address":"ipv6:[::1]:4500"}
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501457] ... gRPC client at localhost:4500 successfully registered
with server
I0304 23:15:41.642209150 501462 server_builder.cc:327] Synchronous
server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
I0304 23:15:41.642437546 501462 socket_utils_common_posix.cc:353]
TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501457] JobClient client constructed.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501457] HelloWorld client constructed.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501457] New id assigned:
localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501457] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
is trying to register client for job type: : hello_world
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501463] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
registered with Job Server 4501
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501457] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
received ACK for RegisterNewClient.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
identified JSON message as issueNewJob.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
accepted job.
[2022-03-04T23:15:41+10:00] [workflow_client] [info] [process 501457,
thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
acknowledged the job.
[2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457,
thread 501485] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
identified JSON message as jobProgressUpdateAck.
[2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457,
thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
received ACK for final JobProgressUpdate.
[2022-03-04T23:15:42+10:00] [workflow_client] [info] [process 501457,
thread 501484] Client localhost:4502:598ec6d9-8a75-4500-b34f-9cab3160c642
finished processing a job and now shutting down.
D0304 23:15:42.482888232 501484 chttp2_transport.cc:1715]
ipv4:127.0.0.1:56916: Sending goaway
err={"created":"@1646399742.482860742","description":"Server
shutdown","file":"/home/hoover/workspace/DPAV/Linux/OpenSource/gRPC-focal/src/core/lib/surface/server.cc","file_line":481,"grpc_status":0}
On Monday, February 14, 2022 at 1:37:12 PM UTC+10 Anh Tran wrote:
> Hi team,
>
> Yes. that is the case. I am also confused...
>
>
> On Saturday, February 12, 2022 at 9:00:29 AM UTC+10 [email protected]
> wrote:
>
>> Hi Ahn,
>>
>> Thanks for the stack trace but I'm a bit confused. Is this what you got
>> while your program was stuck waiting forever? It looks a bit off because
>> Valgrind complains about "Conditional jump or move depends on uninitialised
>> value" and callstack appears to show some deadline inside.
>>
>> On Thursday, February 10, 2022 at 1:10:12 AM UTC-8 Anh Tran wrote:
>>
>>> Hi @google.com,,
>>>
>>> Thanks for getting back to me at this. Please see my stack trace from
>>> valgrind:
>>>
>>> ==287419== Memcheck, a memory error detector
>>> ==287419== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et
>>> al.
>>> ==287419== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright
>>> info
>>> ==287419== Command:
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli --job-type
>>> slam
>>> ==287419== Parent PID: 287418
>>> ==287419==
>>> ==287419== Conditional jump or move depends on uninitialised value(s)
>>> ==287419== at 0x91743A: void** NextStackFrame<true, false>(void**,
>>> void const*, unsigned long, unsigned long) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x916F98: int UnwindImpl<false, false>(void**, int*,
>>> int, int, void const*, int*) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x916D8D: absl::lts_20211102::GetStackTrace(void**,
>>> int, int) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x90F27C: absl::lts_20211102::(anonymous
>>> namespace)::GetStack(void**, int) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x924569:
>>> absl::lts_20211102::synchronization_internal::GraphCycles::UpdateStackTrace(absl::lts_20211102::synchronization_internal::GraphId,
>>>
>>> int, int (*)(void**, int)) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x90F342:
>>> absl::lts_20211102::DeadlockCheck(absl::lts_20211102::Mutex*) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x90F84C:
>>> absl::lts_20211102::DebugOnlyDeadlockCheck(absl::lts_20211102::Mutex*) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x90FB6E: absl::lts_20211102::Mutex::Lock() (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x8A3AD0: gpr_mu_lock (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x7DFA25: begin_worker(grpc_pollset*,
>>> grpc_pollset_worker*, grpc_pollset_worker**, long) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x7DFFDA: pollset_work(grpc_pollset*,
>>> grpc_pollset_worker**, long) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x7716F4: pollset_work(grpc_pollset*,
>>> grpc_pollset_worker**, long) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== Uninitialised value was created by a stack allocation
>>> ==287419== at 0x1C3FD0:
>>> Emesent::GRPCQueue::Stub::Register(grpc::ClientContext*, Emesent::Empty
>>> const&, Emesent::RegistrationResponse*) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419==
>>> ==287419== Thread 7 grpcpp_sync_ser:
>>> ==287419== Conditional jump or move depends on uninitialised value(s)
>>> ==287419== at 0x916F7F: int UnwindImpl<false, false>(void**, int*,
>>> int, int, void const*, int*) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x916D8D: absl::lts_20211102::GetStackTrace(void**,
>>> int, int) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x90F27C: absl::lts_20211102::(anonymous
>>> namespace)::GetStack(void**, int) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x924569:
>>> absl::lts_20211102::synchronization_internal::GraphCycles::UpdateStackTrace(absl::lts_20211102::synchronization_internal::GraphId,
>>>
>>> int, int (*)(void**, int)) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x90F342:
>>> absl::lts_20211102::DeadlockCheck(absl::lts_20211102::Mutex*) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x90F84C:
>>> absl::lts_20211102::DebugOnlyDeadlockCheck(absl::lts_20211102::Mutex*) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x90FB6E: absl::lts_20211102::Mutex::Lock() (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x8A3AD0: gpr_mu_lock (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x5121A4: grpc_cq_begin_op(grpc_completion_queue*,
>>> void*) (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x5199C4:
>>> grpc_core::Server::ShutdownAndNotify(grpc_completion_queue*, void*) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x51CD63: grpc_server_shutdown_and_notify (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== by 0x231AA2: grpc::Server::ShutdownInternal(gpr_timespec)
>>> (in /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419== Uninitialised value was created by a stack allocation
>>> ==287419== at 0x1D37E0: std::_Function_handler<void
>>> (std::__cxx11::basic_string<char, std::char_traits<char>,
>>> std::allocator<char> > const&, std::__cxx11::basic_string<char,
>>> std::char_traits<char>, std::allocator<char> > const&),
>>> workflow_client::GRPCIcomm::start_and_wait()::{lambda(std::__cxx11::basic_string<char,
>>>
>>> std::char_traits<char>, std::allocator<char> > const&,
>>> std::__cxx11::basic_string<char, std::char_traits<char>,
>>> std::allocator<char> > const&)#1}>::_M_invoke(std::_Any_data const&,
>>> std::__cxx11::basic_string<char, std::char_traits<char>,
>>> std::allocator<char> > const&, std::__cxx11::basic_string<char,
>>> std::char_traits<char>, std::allocator<char> > const&) (in
>>> /home/atran/codes/workflow_client/dist/workflow/job_client_cli)
>>> ==287419==
>>>
>>> In case this is interesting, this is how I start my server and wait
>>> until shutdown:
>>>
>>> std::string server_addr_and_port = base_address_ + ":" + id_;
>>> grpc::ServerBuilder server_builder;
>>> server_builder.AddListeningPort(server_addr_and_port, grpc::
>>> InsecureServerCredentials());
>>>
>>> GRPCQueueImpl grpc_queue_service;
>>> server_builder.RegisterService(&grpc_queue_service);
>>> server_ = server_builder.BuildAndStart();
>>> server_->Wait();
>>> On Thursday, February 10, 2022 at 9:08:39 AM UTC+10 [email protected]
>>> wrote:
>>>
>>>> I have also experienced the same issue in my work environment. Our work
>>>> around was to have N second timeout and check a shutdown flag and then
>>>> call
>>>> shutdown.
>>>>
>>>>
>>>>
>>>> On Feb 9, 2022 at 3:54 PM, <[email protected]' via grpc.io> wrote:
>>>>
>>>> Would you capture the call-stack when Shutdown function appears to hang?
>>>>
>>>> On Wednesday, February 2, 2022 at 10:24:14 PM UTC-8 Anh Tran wrote:
>>>>
>>>>> The last commit on my local copy of gRPC is 06 Jan 2022, so not that
>>>>> old.
>>>>> On Thursday, February 3, 2022 at 4:51:16 AM UTC+10 Easwar Swaminathan
>>>>> wrote:
>>>>>
>>>>>> Has anyone of y'all seen this in a more recent version of gRPC.
>>>>>> `1.30.2` is really old.
>>>>>>
>>>>>> On Tuesday, February 1, 2022 at 12:15:54 AM UTC-8 Prasanta Behera
>>>>>> wrote:
>>>>>>
>>>>>>> No, I've not found any solution yet! I was trying to have a
>>>>>>> hello-world kind of example to reproduce to open a ticket, but as you
>>>>>>> said
>>>>>>> with such simple examples it never reproduces...
>>>>>>>
>>>>>>> On Tuesday, February 1, 2022 at 1:42:43 PM UTC+5:30 Anh Tran wrote:
>>>>>>>
>>>>>>>> I should also add that this behavior happens during threading.
>>>>>>>>
>>>>>>>> On Tuesday, February 1, 2022 at 3:45:04 PM UTC+10 Anh Tran wrote:
>>>>>>>>
>>>>>>>>> Unfortunately I came across the same problem.
>>>>>>>>> When trying to isolate the problem (having a simple hello world
>>>>>>>>> server-client), the problem seems to go away, i.e server can be shut
>>>>>>>>> down.
>>>>>>>>>
>>>>>>>>> But when using in my own project, this happens. And despite my
>>>>>>>>> effort to track down the problem, I can't seem to spot one.
>>>>>>>>> This is weird since this version of Shutdown should force shut
>>>>>>>>> down after a certain period, as describe in the doc.
>>>>>>>>>
>>>>>>>>> Prasanta, do you happen to find the solution to your problem?
>>>>>>>>> Maybe it will give a hint to mine.
>>>>>>>>>
>>>>>>>>> On Thursday, January 20, 2022 at 10:02:43 PM UTC+10 Prasanta
>>>>>>>>> Behera wrote:
>>>>>>>>>
>>>>>>>>>> gRPC version : 1.30.2 (c++)
>>>>>>>>>> $ gcc --version
>>>>>>>>>> gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
>>>>>>>>>>
>>>>>>>>>> We use gRPC is synchronous mode and stop our gRPC services with
>>>>>>>>>> this code
>>>>>>>>>>
>>>>>>>>>> m_server->Shutdown(std::chrono::system_clock::now() +
>>>>>>>>>> std::chrono::seconds(1));
>>>>>>>>>> m_server->Wait();
>>>>>>>>>>
>>>>>>>>>> On rare occasions (about once in 1000 tries) the Shutdown() call
>>>>>>>>>> does NOT return and the process just hangs.
>>>>>>>>>>
>>>>>>>>>> Is there anything I missing to have a reliable shutdown of the
>>>>>>>>>> server?
>>>>>>>>>>
>>>>>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "grpc.io" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/grpc-io/96350454-3e52-4068-b3e1-1fea25537098n%40googlegroups.com
>>>>
>>>> <https://groups.google.com/d/msgid/grpc-io/96350454-3e52-4068-b3e1-1fea25537098n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>>
--
You received this message because you are subscribed to the Google Groups
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/grpc-io/7180c446-bfe0-487c-a855-7c7fd09d705fn%40googlegroups.com.