[jira] [Updated] (ATLAS-4659) Atlas in HA mode fails to get healthy

Richard Pijnenburg (Jira) Thu, 18 Aug 2022 08:12:06 -0700


     [ 
https://issues.apache.org/jira/browse/ATLAS-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Richard Pijnenburg updated ATLAS-4659:
--------------------------------------
    Environment: Zookeeper 3.8.0, Cassandra 4.0.5, Solr 8.11.1  (was: Zookeeper 
3.8.0)

> Atlas in HA mode fails to get healthy
> -------------------------------------
>
>                 Key: ATLAS-4659
>                 URL: https://issues.apache.org/jira/browse/ATLAS-4659
>             Project: Atlas
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>         Environment: Zookeeper 3.8.0, Cassandra 4.0.5, Solr 8.11.1
>            Reporter: Richard Pijnenburg
>            Priority: Major
>
> We are trying to setup atlas with the HA functionality using zookeeper 3.8.0
> Relevant logs:
> {code:java}
> 2022-08-18 14:57:06,924 INFO  - [main:] ~ Found matched server id id1 with 
> host port: atlas-0.atlas-headless.atlas.svc.cluster.local:21000 
> (AtlasServerIdSelector:65)
> 2022-08-18 14:57:06,924 INFO  - [main:] ~ Starting leader election for id1 
> (ActiveInstanceElectorService:112)
> 2022-08-18 14:57:06,933 INFO  - [main:] ~ Leader latch started for id1. 
> (ActiveInstanceElectorService:118)
> 2022-08-18 14:57:06,991 INFO  - [main:] ~ AtlasJsonProvider() instantiated 
> (AtlasJsonProvider:53)
> 2022-08-18 14:57:07,296 WARN  - [main-EventThread:] ~ Server instance with 
> server id id1 is elected as leader (ActiveInstanceElectorService:152)
> 2022-08-18 14:57:07,296 WARN  - [main-EventThread:] ~ Instance becoming 
> active from PASSIVE (ServiceState:88
>  
> ———
>  
> 2022-08-18 14:57:27,818 INFO  - [main-EventThread:] ~ Reacting to active 
> state: initializing Kafka consumers (NotificationHookConsumer:421)
> 2022-08-18 14:57:27,819 INFO  - [main-EventThread:] ~ ==> 
> KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1, 
> autoCommitEnabled=false) (KafkaNotification:194)
> 2022-08-18 14:57:28,237 INFO  - [main-EventThread:] ~ <== 
> KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1, 
> autoCommitEnabled=false) (KafkaNotification:234)
> 2022-08-18 14:57:28,402 INFO  - [main-EventThread:] ~ ==> 
> TaskManagement.instanceIsActive() (TaskManagement:94)
> 2022-08-18 14:57:28,402 INFO  - [main-EventThread:] ~ TaskManagement: 
> Started! (TaskManagement:196)
> 2022-08-18 14:57:28,479 INFO  - [NotificationHookConsumer thread-0:] ~ 
> [atlas-hook-consumer-thread]: Starting (Logging:66)
> 2022-08-18 14:57:28,481 INFO  - [NotificationHookConsumer thread-0:] ~ ==> 
> HookConsumer doWork() (NotificationHookConsumer$HookConsumer:540)
> 2022-08-18 14:57:28,483 INFO  - [NotificationHookConsumer thread-0:] ~ Atlas 
> Server is not ready. Waiting for 1000 milliseconds to retry... 
> (NotificationHookConsumer$HookConsumer:940)
> 2022-08-18 14:57:28,485 INFO  - [main-EventThread:] ~ TaskManagement: Found: 
> 0: Tasks in pending state. (TaskManagement:212)
> 2022-08-18 14:57:28,485 INFO  - [main-EventThread:] ~ <== 
> TaskManagement.instanceIsActive() (TaskManagement:98)
> 2022-08-18 14:57:28,485 INFO  - [main-EventThread:] ~ ==> 
> IndexRecoveryService.instanceIsActive() (IndexRecoveryService:117)
> 2022-08-18 14:57:28,485 INFO  - [main-EventThread:] ~ <== 
> IndexRecoveryService.instanceIsActive() (IndexRecoveryService:121)
> 2022-08-18 14:57:28,486 INFO  - [index-health-monitor:] ~ Index Health 
> Monitor: Starting... (IndexRecoveryService$RecoveryThread:175)
> 2022-08-18 14:57:28,487 ERROR - [main-EventThread:] ~ Got exception while 
> activating (ActiveInstanceElectorService:162)
> org.apache.atlas.exception.AtlasBaseException: ActiveInstanceState.update 
> resulted in exception.
>         at 
> org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:119)
>         at 
> org.apache.atlas.web.service.ActiveInstanceElectorService.isLeader(ActiveInstanceElectorService.java:158)
>         at 
> org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702)
>         at 
> org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698)
>         at 
> org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
>         at 
> org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
>         at 
> org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
>         at 
> org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:697)
>         at 
> org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:575)
>         at 
> org.apache.curator.framework.recipes.leader.LeaderLatch.access$600(LeaderLatch.java:65)
>         at 
> org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:626)
>         at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883)
>         at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653)
>         at 
> org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
>         at 
> org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:627)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Caused by: java.lang.IllegalStateException: Expected state [STARTED] was 
> [STOPPED]
>         at 
> org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:823)
>         at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:432)
>         at 
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkExists(CuratorFrameworkImpl.java:459)
>         at 
> org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:109)
>         ... 16 more
> 2022-08-18 14:57:28,487 WARN  - [main-EventThread:] ~ Server instance with 
> server id id1 is removed as leader (ActiveInstanceElectorService:199)
> 2022-08-18 14:57:28,487 WARN  - [main-EventThread:] ~ Instance becoming 
> passive from BECOMING_ACTIVE (ServiceState:119)
> 2022-08-18 14:57:28,487 INFO  - [main-EventThread:] ~ ==> 
> IndexRecoveryService.instanceIsPassive() (IndexRecoveryService:126)
> 2022-08-18 14:57:28,487 INFO  - [main-EventThread:] ~ Index Health Monitor: 
> Shutdown: Starting... (IndexRecoveryService$RecoveryThread:196)
> 2022-08-18 14:57:28,487 INFO  - [main-EventThread:] ~ Index Health Monitor: 
> Shutdown: Done! (IndexRecoveryService$RecoveryThread:206)
> 2022-08-18 14:57:29,484 INFO  - [NotificationHookConsumer thread-0:] ~ Atlas 
> Server is not ready. Waiting for 1000 milliseconds to retry... 
> (NotificationHookConsumer$HookConsumer:940)
> 2022-08-18 14:57:30,484 INFO  - [NotificationHookConsumer thread-0:] ~ Atlas 
> Server is not ready. Waiting for 1000 milliseconds to retry... 
> (NotificationHookConsumer$HookConsumer:940) {code}
> Running Atlas in non ha mode works fine
> The zookeeper instance is also used for Cassandra and Solr and those don't 
> seem to have any issues with Zookeeper.
> It's unclear from the logs where the actual issue is.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ATLAS-4659) Atlas in HA mode fails to get healthy

Reply via email to