[
https://issues.apache.org/jira/browse/IMPALA-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenzhe Zhou resolved IMPALA-13399.
----------------------------------
Fix Version/s: Impala 4.5.0
Resolution: Fixed
> Thrift RPCs for Statestore heartbeat timed out randomly in UBSAN build
> -----------------------------------------------------------------------
>
> Key: IMPALA-13399
> URL: https://issues.apache.org/jira/browse/IMPALA-13399
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Wenzhe Zhou
> Priority: Major
> Labels: flaky-test
> Fix For: Impala 4.5.0
>
>
> This issue was saw in IMPALA-13388.
> In UBSAN build, Thrift RPCs for Statestore heartbeat were timed out in
> Statestored HA unit-tests randomly, and heartbeats were not sent to
> subscribers. See log messages in statestored log file:
> {code:java}
> I0917 00:06:54.902865 3728987 statestore.cc:1414] Unable to send heartbeat
> message to subscriber
> catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000,
> received error: RPC recv timed out: dest address:
> impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23020, rpc:
> N6impala18THeartbeatResponseE
> Subscriber
> catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000
> timed-out during heartbeat RPC. Timeout is 3s.
> I0917 00:06:54.902873 3728987 failure-detector.cc:91] 4 consecutive
> heartbeats failed for
> 'catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000'.
> State is OK
> I0917 00:06:55.382777 3728993 client-cache.h:372] RPC recv timed out: dest
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001,
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:55.382797 3728993 client-cache.cc:174] Broken Connection, destroy
> client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001
> I0917 00:06:55.382866 3728993 statestore.cc:1414] Unable to send heartbeat
> message to subscriber
> impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001,
> received error: RPC recv timed out: dest address:
> impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, rpc:
> N6impala18THeartbeatResponseE
> Subscriber
> impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001
> timed-out during heartbeat RPC. Timeout is 3s.
> I0917 00:06:55.382876 3728993 failure-detector.cc:91] 4 consecutive
> heartbeats failed for
> 'impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001'.
> State is OK
> I0917 00:06:56.032806 3728989 client-cache.h:372] RPC recv timed out: dest
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000,
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:56.032806 3728990 client-cache.h:372] RPC recv timed out: dest
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23002,
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:56.032831 3728989 client-cache.cc:174] Broken Connection, destroy
> client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000
> {code}
> Subscriber (impalad or catalogd) log files showed that subscribers did not
> receive heartbeats after successful registrations so that subscribers tried
> to re-register in the statestored. It's one directional Thrift RPC issue from
> statestored to subscribers. It seems something wrong in Thrift RPC layer.
> Tried to increase value of flag variable
> "statestore_heartbeat_tcp_timeout_seconds" from 3 seconds to 6 seconds, the
> issue still happened.
> This issue did not happen in other type of build, like regular build and ASAN
> build. Statestored HA unit-tests are not ran for TSAN build.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)