[ 
https://issues.apache.org/jira/browse/IMPALA-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou resolved IMPALA-13399.
----------------------------------
    Fix Version/s: Impala 4.5.0
       Resolution: Fixed

> Thrift RPCs for Statestore heartbeat timed out randomly in UBSAN build 
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-13399
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13399
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Wenzhe Zhou
>            Priority: Major
>              Labels: flaky-test
>             Fix For: Impala 4.5.0
>
>
> This issue was saw in IMPALA-13388.
> In UBSAN build, Thrift RPCs for Statestore heartbeat were timed out in 
> Statestored HA unit-tests randomly, and heartbeats were not sent to 
> subscribers. See log messages in statestored log file:
> {code:java}
> I0917 00:06:54.902865 3728987 statestore.cc:1414] Unable to send heartbeat 
> message to subscriber 
> catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000,
>  received error: RPC recv timed out: dest address: 
> impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23020, rpc: 
> N6impala18THeartbeatResponseE
> Subscriber 
> catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000
>  timed-out during heartbeat RPC. Timeout is 3s.
> I0917 00:06:54.902873 3728987 failure-detector.cc:91] 4 consecutive 
> heartbeats failed for 
> 'catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000'.
>  State is OK
> I0917 00:06:55.382777 3728993 client-cache.h:372] RPC recv timed out: dest 
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, 
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:55.382797 3728993 client-cache.cc:174] Broken Connection, destroy 
> client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001
> I0917 00:06:55.382866 3728993 statestore.cc:1414] Unable to send heartbeat 
> message to subscriber 
> impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001, 
> received error: RPC recv timed out: dest address: 
> impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, rpc: 
> N6impala18THeartbeatResponseE
> Subscriber 
> impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001 
> timed-out during heartbeat RPC. Timeout is 3s.
> I0917 00:06:55.382876 3728993 failure-detector.cc:91] 4 consecutive 
> heartbeats failed for 
> 'impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001'. 
> State is OK
> I0917 00:06:56.032806 3728989 client-cache.h:372] RPC recv timed out: dest 
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000, 
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:56.032806 3728990 client-cache.h:372] RPC recv timed out: dest 
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23002, 
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:56.032831 3728989 client-cache.cc:174] Broken Connection, destroy 
> client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000
> {code}
> Subscriber (impalad or catalogd) log files showed that subscribers did not 
> receive heartbeats  after successful registrations so that subscribers tried 
> to re-register in the statestored. It's one directional Thrift RPC issue from 
> statestored to subscribers. It seems something wrong in Thrift RPC layer. 
> Tried to increase value of flag variable 
> "statestore_heartbeat_tcp_timeout_seconds" from 3 seconds to 6 seconds, the 
> issue still happened.
> This issue did not happen in other type of build, like regular build and ASAN 
> build.  Statestored HA unit-tests are not ran for TSAN build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to