markap14 commented on PR #6779:
URL: https://github.com/apache/nifi/pull/6779#issuecomment-1372856078
Thanks for the latest updated @exceptionfactory . Ran into another issue
when testing, unfortunately.
I have a statefulset that had 3 replicas. `nifi-1` was both the primary node
and the coordinator.
I then scaled the statefulset to 0.
This didn't expire the lease though.:
```
mpayne@cs-654103601966-default:~$ k get leases
NAME HOLDER AGE
cluster-coordinator nifi-1.nifi:4423 63m
primary-node nifi-1.nifi:4423 62m
```
Even after I waited over an hour the lease remains there. If I look at it:
```
mpayne@cs-654103601966-default:~$ k get lease cluster-coordinator -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
creationTimestamp: "2023-01-05T21:06:29Z"
name: cluster-coordinator
namespace: nifi
resourceVersion: "252479"
uid: 7e5d05d1-3b20-426d-8822-5cff92eb183f
spec:
acquireTime: "2023-01-05T22:03:17.355642Z"
holderIdentity: nifi-1.nifi:4423
leaseDurationSeconds: 15
leaseTransitions: 2
renewTime: "2023-01-05T22:04:13.480562Z"
mpayne@cs-654103601966-default:~$ date
Thu 05 Jan 2023 10:11:34 PM UTC
```
We can see here that date is well past the renewTime. (10:11:34 PM =
22:11:34 PM vs 22:04:13 as the renew time).
So the least appears to remain, and the new node, `nifi-0` cannot proceed:
```
2023-01-05 22:09:37,513 INFO [main] o.a.n.c.p.AbstractNodeProtocolSender
Cluster Coordinator is located at nifi-1.nifi:4423. Will send Cluster
Connection Request to this address
2023-01-05 22:09:37,535 WARN [main] o.a.nifi.controller.StandardFlowService
Failed to connect to cluster due to:
org.apache.nifi.cluster.protocol.ProtocolException: Failed to create socket to
nifi-1.nifi:4423 due to: java.net.UnknownHostException: nifi-1.nifi
2023-01-05 22:09:42,550 INFO [main]
o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster
Coordinator is located at nifi-1.nifi:4423; will use this address for sending
heartbeat messages
2023-01-05 22:09:42,550 INFO [main] o.a.n.c.p.AbstractNodeProtocolSender
Cluster Coordinator is located at nifi-1.nifi:4423. Will send Cluster
Connection Request to this address
2023-01-05 22:09:42,550 WARN [main] o.a.nifi.controller.StandardFlowService
Failed to connect to cluster due to:
org.apache.nifi.cluster.protocol.ProtocolException: Failed to create socket to
nifi-1.nifi:4423 due to: java.net.UnknownHostException: nifi-1.nifi
```
As soon as I delete the lease (`k delete lease cluster-coordinator`) all
works as expected.
But we obviously can't have users manually deleting the lease all the time.
Not sure if this is the intended behavior, and we should be ignoring the
lease if the renewTime has expired? Or is it because we don't actually
participate in the leader election on startup since there appears to already be
an elected leader?
Either way, we need to make sure that we can properly handle this condition,
where the lease points to a node that is no longer part of the cluster
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]