[jira] [Commented] (IMPALA-13399) Thrift RPCs for Statestore heartbeat timed out randomly in UBSAN build

ASF subversion and git services (Jira) Mon, 23 Sep 2024 23:03:08 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884121#comment-17884121
 ]


ASF subversion and git services commented on IMPALA-13399:
----------------------------------------------------------

Commit d2cd9b51a03dbd8b2e485ee446bf7530656ab214 in impala's branch 
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d2cd9b51a ]

IMPALA-13388: fix unit-tests of Statestore HA for UBSAN builds

Sometimes in UBSAN builds, unit-tests of Statestore HA failed due to
Thrift RPC receiving timeout. Standby statestored failed to send
heartbeats to its subscribers so that failover was not triggered.
The Thrift RPC failures still happened after increasing TCP timeout
for Thrift RPCs between statestored and its subscribers.

This patch adds a metric for number of subscribers which recevied
heartbeats from statestored in a monitoring period. Unit-tests of
Statestored HA for UBSAN build will be skipped if statestored failed
to send heartbeats to more than half of subscribers.
For other builds, throw exception with error message which complain
Thrift RPC failure if statestored failed to send heartbeats to more
than half of subscribers.
Also fixed a bug which calls SecondsSinceHeartbeat() but compares
the retutned value with time value in milli-seconds.

Filed following up JIRA IMPALA-13399 to track the very root cause.

Testing:
 - Looped to run test_statestored_ha.py for 100 times in UBSAN
   build without failed case, but 4 iterations out of 100 have
   skipped test cases.
 - Verified that the issue did not happen for ASAN build by
   running test_statestored_ha.py for 100 times in ASAN build.
 - Passed core test.

Change-Id: Ie59d1e93c635411723f7044da52e4ab19c7d2fac
Reviewed-on: http://gerrit.cloudera.org:8080/21820
Reviewed-by: Riza Suminto <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Thrift RPCs for Statestore heartbeat timed out randomly in UBSAN build 
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-13399
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13399
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Wenzhe Zhou
>            Priority: Major
>              Labels: flaky-test
>
> This issue was saw in IMPALA-13388.
> In UBSAN build, Thrift RPCs for Statestore heartbeat were timed out in 
> Statestored HA unit-tests randomly, and heartbeats were not sent to 
> subscribers. See log messages in statestored log file:
> {code:java}
> I0917 00:06:54.902865 3728987 statestore.cc:1414] Unable to send heartbeat 
> message to subscriber 
> catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000,
>  received error: RPC recv timed out: dest address: 
> impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23020, rpc: 
> N6impala18THeartbeatResponseE
> Subscriber 
> catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000
>  timed-out during heartbeat RPC. Timeout is 3s.
> I0917 00:06:54.902873 3728987 failure-detector.cc:91] 4 consecutive 
> heartbeats failed for 
> 'catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000'.
>  State is OK
> I0917 00:06:55.382777 3728993 client-cache.h:372] RPC recv timed out: dest 
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, 
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:55.382797 3728993 client-cache.cc:174] Broken Connection, destroy 
> client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001
> I0917 00:06:55.382866 3728993 statestore.cc:1414] Unable to send heartbeat 
> message to subscriber 
> impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001, 
> received error: RPC recv timed out: dest address: 
> impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, rpc: 
> N6impala18THeartbeatResponseE
> Subscriber 
> impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001 
> timed-out during heartbeat RPC. Timeout is 3s.
> I0917 00:06:55.382876 3728993 failure-detector.cc:91] 4 consecutive 
> heartbeats failed for 
> 'impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001'. 
> State is OK
> I0917 00:06:56.032806 3728989 client-cache.h:372] RPC recv timed out: dest 
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000, 
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:56.032806 3728990 client-cache.h:372] RPC recv timed out: dest 
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23002, 
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:56.032831 3728989 client-cache.cc:174] Broken Connection, destroy 
> client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000
> {code}
> Subscriber (impalad or catalogd) log files showed that subscribers did not 
> receive heartbeats  after successful registrations so that subscribers tried 
> to re-register in the statestored. It's one directional Thrift RPC issue from 
> statestored to subscribers. It seems something wrong in Thrift RPC layer. 
> Tried to increase value of flag variable 
> "statestore_heartbeat_tcp_timeout_seconds" from 3 seconds to 6 seconds, the 
> issue still happened.
> This issue did not happen in other type of build, like regular build and ASAN 
> build.  Statestored HA unit-tests are not ran for TSAN build.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13399) Thrift RPCs for Statestore heartbeat timed out randomly in UBSAN build

Reply via email to