Anshul Mehta created ATLAS-4543:
-----------------------------------

             Summary: Atlas unable to join zookeeper election
                 Key: ATLAS-4543
                 URL: https://issues.apache.org/jira/browse/ATLAS-4543
             Project: Atlas
          Issue Type: Bug
            Reporter: Anshul Mehta
         Attachments: Screenshot 2022-01-28 at 8.07.06 PM.png

We are running Atlas on Kubernetes in HA mode. This is what our Kubernetes 
namespace looks like:

!Screenshot 2022-01-28 at 8.07.06 PM.png!

*Details of the issue -*

When only one zookeeper pod goes down and the other two are up. The zookeeper 
cluster remains healthy as expected. We even verified this by exec'ing in the 
zookeeper pods and checking for a leader, and one of the two remaining pods is 
always a leader.

Now if during this time any atlas pod gets deleted, its corresponding znode 
also gets deleted from the zookeeper. *But when the pod is created again its 
corresponding znode is not created in the zookeeper.* This basically means this 
pod is not part of the election process anymore which can eventually lead to 
issues like both atlas pods being passive at the same time.

One more thing that we tested during the time when one zookeeper pod was down 
and the other two were up  - 
We ran zookeeper CLI connected to one of the running zookeeper pods, inside one 
of the atlas pods. And then tried to create a znode. We observed that 
corresponding znode got created in both the running zookeeper pods.

So it looks like the issue only comes when Atlas code tries to create znode 
(join the election) via the CuratorFramework.


*How this issue can cause both atlas pods to go into the passive state -*
The following is not a random test case but can happen in reality if you use 
AWS Spot Instances for your deployment. 

*Phase 1*
atlas-0 -> active
atlas-1 -> passive
Only two zookeeper pods running out of three

*Phase 2*
If the atlas-0 pod gets deleted and comes back up after some time. The 
corresponding znode will not be created in zookeeper. Although at this point 
the atlas-1 pod will be active and everything will work fine.

atlas-0 -> passive
atlas-1 -> active

*Phase 3*
Now if the atlas-1 pod gets deleted and comes back up. The corresponding znode 
will not be created in zookeeper.
znode corresponding to atlas-0 already got deleted in Phase 2 (and did not get 
created) and znode corresponding to atlas-1 got deleted (but did not get 
created) in this phase. And now there are no znodes, so the zookeeper can't 
elect any atlas pod as a leader.
Current atlas pods state -

atlas-0 -> passive
atlas-1 -> passive

*Phase 4*
Restart both the atlas pods when all the three zookeeper pods are up and 
running, and everything comes back to normal with one pod being active and the 
other one being passive.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to