Wenzhe Zhou created IMPALA-13399:
------------------------------------
Summary: Thrift RPCs for Statestore heartbeat timed out randomly
in UBSAN build
Key: IMPALA-13399
URL: https://issues.apache.org/jira/browse/IMPALA-13399
Project: IMPALA
Issue Type: Bug
Components: Backend
Reporter: Wenzhe Zhou
This issue was saw in IMPALA-13388.
In UBSAN build, Thrift RPCs for Statestore heartbeat were timed out in
Statestored HA unit-tests randomly, and heartbeats were not sent to
subscribers. See log messages in statestored log file:
{code:java}
I0917 00:06:54.902865 3728987 statestore.cc:1414] Unable to send heartbeat
message to subscriber
catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000,
received error: RPC recv timed out: dest address:
impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23020, rpc:
N6impala18THeartbeatResponseE
Subscriber
catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000
timed-out during heartbeat RPC. Timeout is 3s.
I0917 00:06:54.902873 3728987 failure-detector.cc:91] 4 consecutive heartbeats
failed for
'catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000'.
State is OK
I0917 00:06:55.382777 3728993 client-cache.h:372] RPC recv timed out: dest
address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001,
rpc: N6impala18THeartbeatResponseE
I0917 00:06:55.382797 3728993 client-cache.cc:174] Broken Connection, destroy
client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001
I0917 00:06:55.382866 3728993 statestore.cc:1414] Unable to send heartbeat
message to subscriber
impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001,
received error: RPC recv timed out: dest address:
impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, rpc:
N6impala18THeartbeatResponseE
Subscriber
impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001
timed-out during heartbeat RPC. Timeout is 3s.
I0917 00:06:55.382876 3728993 failure-detector.cc:91] 4 consecutive heartbeats
failed for
'impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001'.
State is OK
I0917 00:06:56.032806 3728989 client-cache.h:372] RPC recv timed out: dest
address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000,
rpc: N6impala18THeartbeatResponseE
I0917 00:06:56.032806 3728990 client-cache.h:372] RPC recv timed out: dest
address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23002,
rpc: N6impala18THeartbeatResponseE
I0917 00:06:56.032831 3728989 client-cache.cc:174] Broken Connection, destroy
client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000
{code}
Subscriber (impalad or catalogd) log files showed that subscribers did not
receive heartbeats after successful registrations so that subscribers tried to
re-register in the statestores. It's one directional Thrift RPC issue from
statestored to subscribers. It seems something wrong in Thrift RPC layer.
Tried to increase value of flag variable
"statestore_heartbeat_tcp_timeout_seconds" from 3 seconds to 6 seconds, the
issue still happened.
This issue did not happen in other type of build, like regular build and ASAN
build. Statestored HA unit-tests are not ran for TSAN build.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)