Wenzhe Zhou created IMPALA-13399:
------------------------------------

             Summary: Thrift RPCs for Statestore heartbeat timed out randomly 
in UBSAN build 
                 Key: IMPALA-13399
                 URL: https://issues.apache.org/jira/browse/IMPALA-13399
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Wenzhe Zhou


This issue was saw in IMPALA-13388.
In UBSAN build, Thrift RPCs for Statestore heartbeat were timed out in 
Statestored HA unit-tests randomly, and heartbeats were not sent to 
subscribers. See log messages in statestored log file:
{code:java}
I0917 00:06:54.902865 3728987 statestore.cc:1414] Unable to send heartbeat 
message to subscriber 
catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000,
 received error: RPC recv timed out: dest address: 
impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23020, rpc: 
N6impala18THeartbeatResponseE
Subscriber 
catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000
 timed-out during heartbeat RPC. Timeout is 3s.
I0917 00:06:54.902873 3728987 failure-detector.cc:91] 4 consecutive heartbeats 
failed for 
'catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000'.
 State is OK
I0917 00:06:55.382777 3728993 client-cache.h:372] RPC recv timed out: dest 
address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, 
rpc: N6impala18THeartbeatResponseE
I0917 00:06:55.382797 3728993 client-cache.cc:174] Broken Connection, destroy 
client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001
I0917 00:06:55.382866 3728993 statestore.cc:1414] Unable to send heartbeat 
message to subscriber 
impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001, 
received error: RPC recv timed out: dest address: 
impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, rpc: 
N6impala18THeartbeatResponseE
Subscriber 
impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001 
timed-out during heartbeat RPC. Timeout is 3s.
I0917 00:06:55.382876 3728993 failure-detector.cc:91] 4 consecutive heartbeats 
failed for 
'impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001'. 
State is OK
I0917 00:06:56.032806 3728989 client-cache.h:372] RPC recv timed out: dest 
address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000, 
rpc: N6impala18THeartbeatResponseE
I0917 00:06:56.032806 3728990 client-cache.h:372] RPC recv timed out: dest 
address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23002, 
rpc: N6impala18THeartbeatResponseE
I0917 00:06:56.032831 3728989 client-cache.cc:174] Broken Connection, destroy 
client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000
{code}
Subscriber (impalad or catalogd) log files showed that subscribers did not 
receive heartbeats  after successful registrations so that subscribers tried to 
re-register in the statestores. It's one directional Thrift RPC issue from 
statestored to subscribers. It seems something wrong in Thrift RPC layer. 

Tried to increase value of flag variable 
"statestore_heartbeat_tcp_timeout_seconds" from 3 seconds to 6 seconds, the 
issue still happened.

This issue did not happen in other type of build, like regular build and ASAN 
build.  Statestored HA unit-tests are not ran for TSAN build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to