Richard Pijnenburg created ATLAS-4659:
-----------------------------------------
Summary: Atlas in HA mode fails to get healthy
Key: ATLAS-4659
URL: https://issues.apache.org/jira/browse/ATLAS-4659
Project: Atlas
Issue Type: Bug
Affects Versions: 3.0.0
Environment: Zookeeper 3.8.0
Reporter: Richard Pijnenburg
We are trying to setup atlas with the HA functionality using zookeeper 3.8.0
Relevant logs:
{code:java}
2022-08-18 14:57:06,924 INFO - [main:] ~ Found matched server id id1 with host
port: atlas-0.atlas-headless.atlas.svc.cluster.local:21000
(AtlasServerIdSelector:65)
2022-08-18 14:57:06,924 INFO - [main:] ~ Starting leader election for id1
(ActiveInstanceElectorService:112)
2022-08-18 14:57:06,933 INFO - [main:] ~ Leader latch started for id1.
(ActiveInstanceElectorService:118)
2022-08-18 14:57:06,991 INFO - [main:] ~ AtlasJsonProvider() instantiated
(AtlasJsonProvider:53)
2022-08-18 14:57:07,296 WARN - [main-EventThread:] ~ Server instance with
server id id1 is elected as leader (ActiveInstanceElectorService:152)
2022-08-18 14:57:07,296 WARN - [main-EventThread:] ~ Instance becoming active
from PASSIVE (ServiceState:88
———
2022-08-18 14:57:27,818 INFO - [main-EventThread:] ~ Reacting to active state:
initializing Kafka consumers (NotificationHookConsumer:421)
2022-08-18 14:57:27,819 INFO - [main-EventThread:] ~ ==>
KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1,
autoCommitEnabled=false) (KafkaNotification:194)
2022-08-18 14:57:28,237 INFO - [main-EventThread:] ~ <==
KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1,
autoCommitEnabled=false) (KafkaNotification:234)
2022-08-18 14:57:28,402 INFO - [main-EventThread:] ~ ==>
TaskManagement.instanceIsActive() (TaskManagement:94)
2022-08-18 14:57:28,402 INFO - [main-EventThread:] ~ TaskManagement: Started!
(TaskManagement:196)
2022-08-18 14:57:28,479 INFO - [NotificationHookConsumer thread-0:] ~
[atlas-hook-consumer-thread]: Starting (Logging:66)
2022-08-18 14:57:28,481 INFO - [NotificationHookConsumer thread-0:] ~ ==>
HookConsumer doWork() (NotificationHookConsumer$HookConsumer:540)
2022-08-18 14:57:28,483 INFO - [NotificationHookConsumer thread-0:] ~ Atlas
Server is not ready. Waiting for 1000 milliseconds to retry...
(NotificationHookConsumer$HookConsumer:940)
2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ TaskManagement: Found: 0:
Tasks in pending state. (TaskManagement:212)
2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ <==
TaskManagement.instanceIsActive() (TaskManagement:98)
2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ ==>
IndexRecoveryService.instanceIsActive() (IndexRecoveryService:117)
2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ <==
IndexRecoveryService.instanceIsActive() (IndexRecoveryService:121)
2022-08-18 14:57:28,486 INFO - [index-health-monitor:] ~ Index Health Monitor:
Starting... (IndexRecoveryService$RecoveryThread:175)
2022-08-18 14:57:28,487 ERROR - [main-EventThread:] ~ Got exception while
activating (ActiveInstanceElectorService:162)
org.apache.atlas.exception.AtlasBaseException: ActiveInstanceState.update
resulted in exception.
at
org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:119)
at
org.apache.atlas.web.service.ActiveInstanceElectorService.isLeader(ActiveInstanceElectorService.java:158)
at
org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702)
at
org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698)
at
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at
org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at
org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at
org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:697)
at
org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:575)
at
org.apache.curator.framework.recipes.leader.LeaderLatch.access$600(LeaderLatch.java:65)
at
org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:626)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653)
at
org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
at
org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:627)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: java.lang.IllegalStateException: Expected state [STARTED] was
[STOPPED]
at
org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:823)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:432)
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkExists(CuratorFrameworkImpl.java:459)
at
org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:109)
... 16 more
2022-08-18 14:57:28,487 WARN - [main-EventThread:] ~ Server instance with
server id id1 is removed as leader (ActiveInstanceElectorService:199)
2022-08-18 14:57:28,487 WARN - [main-EventThread:] ~ Instance becoming passive
from BECOMING_ACTIVE (ServiceState:119)
2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ ==>
IndexRecoveryService.instanceIsPassive() (IndexRecoveryService:126)
2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ Index Health Monitor:
Shutdown: Starting... (IndexRecoveryService$RecoveryThread:196)
2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ Index Health Monitor:
Shutdown: Done! (IndexRecoveryService$RecoveryThread:206)
2022-08-18 14:57:29,484 INFO - [NotificationHookConsumer thread-0:] ~ Atlas
Server is not ready. Waiting for 1000 milliseconds to retry...
(NotificationHookConsumer$HookConsumer:940)
2022-08-18 14:57:30,484 INFO - [NotificationHookConsumer thread-0:] ~ Atlas
Server is not ready. Waiting for 1000 milliseconds to retry...
(NotificationHookConsumer$HookConsumer:940) {code}
Running Atlas in non ha mode works fine
The zookeeper instance is also used for Cassandra and Solr and those don't seem
to have any issues with Zookeeper.
It's unclear from the logs where the actual issue is.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)