[GitHub] [apisix] wofr opened a new issue, #8036: help request: ETCD cluster loses member on GKE every day at 5am

GitBox Wed, 05 Oct 2022 13:08:48 -0700


wofr opened a new issue, #8036:
URL: https://github.com/apache/apisix/issues/8036


   ### Description
   
   I do run a GKE Cluster with 3 Nodes. Beside several application I also 
deployed the APISIX gateway on the cluster (chart: apisix, repoURL: 
https://charts.apiseven.com/, targetRevision: "0.11.0"), which does deploy an 
etcd-cluster (version 3.4.14) with 3 nodes.
   
   Now it gets funny, the etcd cluster builds up fine and everything is ok 
until every-day at 5:00 am, at this time the 3rd member of the cluster is 
leaving the cluster, the second node just stays fine. (See the logs below)
   
   Logs (etcd-0 node)
   
   ```
   2022-09-29 04:59:04.652 CEST etcd {"caller":"etcdserver/zap_raft.go:77", 
"level":"info", "logger":"raft", "msg":"90126cc714381e07 switched to 
configuration voters=(3177002992052145560 10381479693335928327)", 
"ts":"2022-09-29T02:59:04.652Z"}
   2022-09-29 04:59:04.653 CEST etcd {"caller":"membership/cluster.go:472", 
"cluster-id":"b0d7015fda1525c8", "level":"info", 
"local-member-id":"90126cc714381e07", "msg":"removed member", 
"removed-remote-peer-id":"3ff1b5cd453a87df", "removed-remote-peer-urls":[…], 
"ts":"2022-09-29T02:59:04.653Z"}
   2022-09-29 04:59:04.653 CEST etcd {"caller":"rafthttp/peer.go:330", 
"level":"info", "msg":"stopping remote peer", 
"remote-peer-id":"3ff1b5cd453a87df", "ts":"2022-09-29T02:59:04.653Z"}
   ```
   
   Logs (etcd-2 node)
   ```
   04:59:04.655 CEST{caller: rafthttp/stream.go:421, error: EOF, level: warn, 
local-member-id: 3ff1b5cd453a87df, msg: lost TCP streaming connection with 
remote peer, remote-peer-id: 90126cc714381e07, stream-reader-type: stream 
MsgApp v2, ts: 2022-09-29T02:59:04.654Z}
   04:59:04.678 CEST{caller: rafthttp/stream.go:421, error: EOF, level: warn, 
local-member-id: 3ff1b5cd453a87df, msg: lost TCP streaming connection with 
remote peer, remote-peer-id: 90126cc714381e07, stream-reader-type: stream 
Message, ts: 2022-09-29T02:59:04.656Z}
   04:59:04.678 CEST{caller: etcdserver/zap_raft.go:77, level: info, logger: 
raft, msg: 3ff1b5cd453a87df switched to configuration 
voters=(3177002992052145560 10381479693335928327), ts: 2022-09-29T02:59:04.653Z}
   04:59:04.678 CEST{caller: membership/cluster.go:472, cluster-id: 
b0d7015fda1525c8, level: info, local-member-id: 3ff1b5cd453a87df, msg: removed 
member, removed-remote-peer-id: 3ff1b5cd453a87df, removed-remote-peer-urls: 
[…], ts: 2022-09-29T02:59:04.657Z}
   04:59:04.678 CEST{caller: rafthttp/peer_status.go:66, error: failed to dial 
90126cc714381e07 on stream MsgApp v2 (the member has been permanently removed 
from the cluster), level: warn, msg: peer became inactive (message send to peer 
failed), peer-id: 90126cc714381e07, ts: 2022-09-29T02:59:04.659Z}
   04:59:04.678 CEST{caller: etcdserver/server.go:1150, error: the member has 
been permanently removed from the cluster, level: warn, msg: server error, ts: 
2022-09-29T02:59:04.659Z}
   04:59:04.678 CEST{caller: etcdserver/server.go:1151, level: warn, msg: 
data-dir used by this member must be removed, ts: 2022-09-29T02:59:04.659Z}
   04:59:04.678 CEST{caller: rafthttp/peer.go:330, level: info, msg: stopping 
remote peer, remote-peer-id: 2c16fb63879f0d98, ts: 2022-09-29T02:59:04.660Z}
   ```
   I've observed the behaviour now serveral times and I have no idea what 
caused it. For me it seems to be a "problem of the GKE" rather than ETCD 
nevertheless maybe some of you do have idea what could cause the problem.
   
   Funny fact it is always the 3rd node of the etcd cluster which got removed 
at 5 am.
   
   ### Environment
   
   - APISIX version (run `apisix version`): 
/usr/local/openresty/luajit/bin/luajit ./apisix/cli/apisix.lua version 2.15.0
   - Operating system (run `uname -a`): Linux apisix-6dffdc8545-jn8sp 5.10.127+ 
#1 SMP Sat Jul 16 08:53:19 UTC 2022 x86_64 Linux
   - OpenResty / Nginx version (run `openresty -V` or `nginx -V`): nginx 
version: openresty/1.21.4.1 built by gcc 10.3.1 20210424 Alpine 
10.3.1_git20210424) built with OpenSSL 1.1.1g  21 Apr 2020
   - etcd version, if relevant (run `curl 
http://127.0.0.1:9090/v1/server_info`):  3.4.14
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [apisix] wofr opened a new issue, #8036: help request: ETCD cluster loses member on GKE every day at 5am

Reply via email to