[
https://issues.apache.org/jira/browse/IMPALA-13399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884252#comment-17884252
]
Wenzhe Zhou commented on IMPALA-13399:
--------------------------------------
After increasing value of flag
variable"statestore_heartbeat_tcp_timeout_seconds" from default value 3 to 6
(or 10), the issue still happens. But the log messages were changed as
{code:java}
I0924 03:03:44.437633 19167 statestore.cc:984] Registering:
impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27002
I0924 03:03:44.437691 19167 statestore.cc:568] Subscriber
'impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27002'
with type COORDINATOR AND EXECUTOR registered (registration id:
a74408ad8ded5679:a3ed4a7589fba582)
I0924 03:03:44.459619 19321 statestore.cc:320] Subscriber protocol version: V2
I0924 03:03:44.459733 19321 statestore.cc:984] Registering:
impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27000
I0924 03:03:44.459785 19321 statestore.cc:568] Subscriber
'impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27000'
with type COORDINATOR AND EXECUTOR registered (registration id:
13439739dab30251:79f422c4f2a9bfbf)
I0924 03:03:44.606819 19492 statestore.cc:320] Subscriber protocol version: V2
I0924 03:03:44.606947 19492 statestore.cc:984] Registering:
impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27001
I0924 03:03:44.606992 19492 statestore.cc:568] Subscriber
'impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27001'
with type COORDINATOR AND EXECUTOR registered (registration id:
044309a5f1ada273:2f9f17ae652e1094)
I0924 03:03:49.261722 19132 statestore.cc:320] Subscriber protocol version: V2
I0924 03:03:49.261860 19132 statestore.cc:984] Registering:
catalog-ser...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:26000
I0924 03:03:49.261905 19132 statestore.cc:1007] Active catalogd role is
designated to
catalog-ser...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:26000
I0924 03:03:49.261917 19132 statestore.cc:568] Subscriber
'catalog-ser...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:26000'
with type CATALOGD registered (registration id:
ac4d6b270d142d5e:a3e50ee32808d087)
I0924 03:03:49.381103 18204 statestore.cc:1479] 0/4 subscribers successfully
heartbeat in the last 6000ms.
I0924 03:03:49.438024 19167 statestore.cc:320] Subscriber protocol version: V2
I0924 03:03:49.438146 19167 statestore.cc:984] Registering:
impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27002
I0924 03:03:49.438190 19167 statestore.cc:568] Subscriber
'impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27002'
with type COORDINATOR AND EXECUTOR registered (registration id:
d84c49f5d50e48cc:ab412dda373b158d)
I0924 03:03:49.449852 18200 client-cache.h:372] RPC recv timed out: dest
address: impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23002,
rpc: N6impala18THeartbeatResponseE
I0924 03:03:49.449868 18200 client-cache.cc:174] Broken Connection, destroy
client for impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23002
I0924 03:03:49.460111 19321 statestore.cc:320] Subscriber protocol version: V2
I0924 03:03:49.460225 19321 statestore.cc:984] Registering:
impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27000
I0924 03:03:49.460265 19321 statestore.cc:568] Subscriber
'impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27000'
with type COORDINATOR AND EXECUTOR registered (registration id:
82467d0c3b01c930:190fac7145880ebf)
I0924 03:03:49.463809 18201 client-cache.h:372] RPC recv timed out: dest
address: impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23000,
rpc: N6impala18THeartbeatResponseE
I0924 03:03:49.463821 18201 client-cache.cc:174] Broken Connection, destroy
client for impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23000
I0924 03:03:49.607353 19492 statestore.cc:320] Subscriber protocol version: V2
I0924 03:03:49.607489 19492 statestore.cc:984] Registering:
impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27001
I0924 03:03:49.607539 19492 statestore.cc:568] Subscriber
'impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27001'
with type COORDINATOR AND EXECUTOR registered (registration id:
0849d9441460588a:fcee006f7ac187af)
I0924 03:03:49.649829 18193 client-cache.h:372] RPC recv timed out: dest
address: impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23001,
rpc: N6impala18THeartbeatResponseE
I0924 03:03:49.649849 18193 client-cache.cc:174] Broken Connection, destroy
client for impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23001
I0924 03:03:54.262281 19132 statestore.cc:320] Subscriber protocol version: V2
I0924 03:03:54.262408 19132 statestore.cc:984] Registering:
catalog-ser...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:26000
I0924 03:03:54.262451 19132 statestore.cc:1007] Active catalogd role is
designated to
catalog-ser...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:26000
I0924 03:03:54.262463 19132 statestore.cc:568] Subscriber
'catalog-ser...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:26000'
with type CATALOGD registered (registration id:
3c4fd8055a40fb80:0ad1656a6d11f4b9)
I0924 03:03:54.304822 18194 client-cache.h:372] RPC recv timed out: dest
address: impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23020,
rpc: N6impala18THeartbeatResponseE
I0924 03:03:54.304840 18194 client-cache.cc:174] Broken Connection, destroy
client for impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23020
I0924 03:03:54.437829 18199 client-cache.h:372] RPC recv timed out: dest
address: impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23002,
rpc: N6impala18THeartbeatResponseE
I0924 03:03:54.437848 18199 client-cache.cc:174] Broken Connection, destroy
client for impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:23002
I0924 03:03:54.438506 19167 statestore.cc:320] Subscriber protocol version: V2
I0924 03:03:54.438620 19167 statestore.cc:984] Registering:
impa...@impala-ec2-centos79-m6i-4xlarge-xldisk-1c7c.vpc.cloudera.com:27002
{code}
assert_log_contains() in the patch of IMPALA-13388 fails
> Thrift RPCs for Statestore heartbeat timed out randomly in UBSAN build
> -----------------------------------------------------------------------
>
> Key: IMPALA-13399
> URL: https://issues.apache.org/jira/browse/IMPALA-13399
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Wenzhe Zhou
> Priority: Major
> Labels: flaky-test
>
> This issue was saw in IMPALA-13388.
> In UBSAN build, Thrift RPCs for Statestore heartbeat were timed out in
> Statestored HA unit-tests randomly, and heartbeats were not sent to
> subscribers. See log messages in statestored log file:
> {code:java}
> I0917 00:06:54.902865 3728987 statestore.cc:1414] Unable to send heartbeat
> message to subscriber
> catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000,
> received error: RPC recv timed out: dest address:
> impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23020, rpc:
> N6impala18THeartbeatResponseE
> Subscriber
> catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000
> timed-out during heartbeat RPC. Timeout is 3s.
> I0917 00:06:54.902873 3728987 failure-detector.cc:91] 4 consecutive
> heartbeats failed for
> 'catalog-ser...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:26000'.
> State is OK
> I0917 00:06:55.382777 3728993 client-cache.h:372] RPC recv timed out: dest
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001,
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:55.382797 3728993 client-cache.cc:174] Broken Connection, destroy
> client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001
> I0917 00:06:55.382866 3728993 statestore.cc:1414] Unable to send heartbeat
> message to subscriber
> impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001,
> received error: RPC recv timed out: dest address:
> impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23001, rpc:
> N6impala18THeartbeatResponseE
> Subscriber
> impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001
> timed-out during heartbeat RPC. Timeout is 3s.
> I0917 00:06:55.382876 3728993 failure-detector.cc:91] 4 consecutive
> heartbeats failed for
> 'impa...@impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:27001'.
> State is OK
> I0917 00:06:56.032806 3728989 client-cache.h:372] RPC recv timed out: dest
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000,
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:56.032806 3728990 client-cache.h:372] RPC recv timed out: dest
> address: impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23002,
> rpc: N6impala18THeartbeatResponseE
> I0917 00:06:56.032831 3728989 client-cache.cc:174] Broken Connection, destroy
> client for impala-ec2-rhel88-m7g-4xlarge-ondemand-1aed.vpc.cloudera.com:23000
> {code}
> Subscriber (impalad or catalogd) log files showed that subscribers did not
> receive heartbeats after successful registrations so that subscribers tried
> to re-register in the statestored. It's one directional Thrift RPC issue from
> statestored to subscribers. It seems something wrong in Thrift RPC layer.
> Tried to increase value of flag variable
> "statestore_heartbeat_tcp_timeout_seconds" from 3 seconds to 6 seconds, the
> issue still happened.
> This issue did not happen in other type of build, like regular build and ASAN
> build. Statestored HA unit-tests are not ran for TSAN build.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]