[
https://issues.apache.org/jira/browse/ATLAS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Richard Pijnenburg updated ATLAS-4659:
--------------------------------------
Environment: Zookeeper 3.8.0, Cassandra 4.0.5, Solr 8.11.1 (was: Zookeeper
3.8.0)
> Atlas in HA mode fails to get healthy
> -------------------------------------
>
> Key: ATLAS-4659
> URL: https://issues.apache.org/jira/browse/ATLAS-4659
> Project: Atlas
> Issue Type: Bug
> Affects Versions: 3.0.0
> Environment: Zookeeper 3.8.0, Cassandra 4.0.5, Solr 8.11.1
> Reporter: Richard Pijnenburg
> Priority: Major
>
> We are trying to setup atlas with the HA functionality using zookeeper 3.8.0
> Relevant logs:
> {code:java}
> 2022-08-18 14:57:06,924 INFO - [main:] ~ Found matched server id id1 with
> host port: atlas-0.atlas-headless.atlas.svc.cluster.local:21000
> (AtlasServerIdSelector:65)
> 2022-08-18 14:57:06,924 INFO - [main:] ~ Starting leader election for id1
> (ActiveInstanceElectorService:112)
> 2022-08-18 14:57:06,933 INFO - [main:] ~ Leader latch started for id1.
> (ActiveInstanceElectorService:118)
> 2022-08-18 14:57:06,991 INFO - [main:] ~ AtlasJsonProvider() instantiated
> (AtlasJsonProvider:53)
> 2022-08-18 14:57:07,296 WARN - [main-EventThread:] ~ Server instance with
> server id id1 is elected as leader (ActiveInstanceElectorService:152)
> 2022-08-18 14:57:07,296 WARN - [main-EventThread:] ~ Instance becoming
> active from PASSIVE (ServiceState:88
>
> ———
>
> 2022-08-18 14:57:27,818 INFO - [main-EventThread:] ~ Reacting to active
> state: initializing Kafka consumers (NotificationHookConsumer:421)
> 2022-08-18 14:57:27,819 INFO - [main-EventThread:] ~ ==>
> KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1,
> autoCommitEnabled=false) (KafkaNotification:194)
> 2022-08-18 14:57:28,237 INFO - [main-EventThread:] ~ <==
> KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1,
> autoCommitEnabled=false) (KafkaNotification:234)
> 2022-08-18 14:57:28,402 INFO - [main-EventThread:] ~ ==>
> TaskManagement.instanceIsActive() (TaskManagement:94)
> 2022-08-18 14:57:28,402 INFO - [main-EventThread:] ~ TaskManagement:
> Started! (TaskManagement:196)
> 2022-08-18 14:57:28,479 INFO - [NotificationHookConsumer thread-0:] ~
> [atlas-hook-consumer-thread]: Starting (Logging:66)
> 2022-08-18 14:57:28,481 INFO - [NotificationHookConsumer thread-0:] ~ ==>
> HookConsumer doWork() (NotificationHookConsumer$HookConsumer:540)
> 2022-08-18 14:57:28,483 INFO - [NotificationHookConsumer thread-0:] ~ Atlas
> Server is not ready. Waiting for 1000 milliseconds to retry...
> (NotificationHookConsumer$HookConsumer:940)
> 2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ TaskManagement: Found:
> 0: Tasks in pending state. (TaskManagement:212)
> 2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ <==
> TaskManagement.instanceIsActive() (TaskManagement:98)
> 2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ ==>
> IndexRecoveryService.instanceIsActive() (IndexRecoveryService:117)
> 2022-08-18 14:57:28,485 INFO - [main-EventThread:] ~ <==
> IndexRecoveryService.instanceIsActive() (IndexRecoveryService:121)
> 2022-08-18 14:57:28,486 INFO - [index-health-monitor:] ~ Index Health
> Monitor: Starting... (IndexRecoveryService$RecoveryThread:175)
> 2022-08-18 14:57:28,487 ERROR - [main-EventThread:] ~ Got exception while
> activating (ActiveInstanceElectorService:162)
> org.apache.atlas.exception.AtlasBaseException: ActiveInstanceState.update
> resulted in exception.
> at
> org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:119)
> at
> org.apache.atlas.web.service.ActiveInstanceElectorService.isLeader(ActiveInstanceElectorService.java:158)
> at
> org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702)
> at
> org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698)
> at
> org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
> at
> org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
> at
> org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
> at
> org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:697)
> at
> org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:575)
> at
> org.apache.curator.framework.recipes.leader.LeaderLatch.access$600(LeaderLatch.java:65)
> at
> org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:626)
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883)
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653)
> at
> org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
> at
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:627)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: java.lang.IllegalStateException: Expected state [STARTED] was
> [STOPPED]
> at
> org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:823)
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:432)
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkExists(CuratorFrameworkImpl.java:459)
> at
> org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:109)
> ... 16 more
> 2022-08-18 14:57:28,487 WARN - [main-EventThread:] ~ Server instance with
> server id id1 is removed as leader (ActiveInstanceElectorService:199)
> 2022-08-18 14:57:28,487 WARN - [main-EventThread:] ~ Instance becoming
> passive from BECOMING_ACTIVE (ServiceState:119)
> 2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ ==>
> IndexRecoveryService.instanceIsPassive() (IndexRecoveryService:126)
> 2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ Index Health Monitor:
> Shutdown: Starting... (IndexRecoveryService$RecoveryThread:196)
> 2022-08-18 14:57:28,487 INFO - [main-EventThread:] ~ Index Health Monitor:
> Shutdown: Done! (IndexRecoveryService$RecoveryThread:206)
> 2022-08-18 14:57:29,484 INFO - [NotificationHookConsumer thread-0:] ~ Atlas
> Server is not ready. Waiting for 1000 milliseconds to retry...
> (NotificationHookConsumer$HookConsumer:940)
> 2022-08-18 14:57:30,484 INFO - [NotificationHookConsumer thread-0:] ~ Atlas
> Server is not ready. Waiting for 1000 milliseconds to retry...
> (NotificationHookConsumer$HookConsumer:940) {code}
> Running Atlas in non ha mode works fine
> The zookeeper instance is also used for Cassandra and Solr and those don't
> seem to have any issues with Zookeeper.
> It's unclear from the logs where the actual issue is.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)