[jira] [Created] (ATLAS-4659) Atlas in HA mode fails to get healthy

Richard Pijnenburg (Jira) Thu, 18 Aug 2022 08:07:10 -0700

Richard Pijnenburg created ATLAS-4659:
-----------------------------------------


             Summary: Atlas in HA mode fails to get healthy
                 Key: ATLAS-4659
                 URL: https://issues.apache.org/jira/browse/ATLAS-4659
             Project: Atlas
          Issue Type: Bug
    Affects Versions: 3.0.0
         Environment: Zookeeper 3.8.0
            Reporter: Richard Pijnenburg


We are trying to setup atlas with the HA functionality using zookeeper 3.8.0

Relevant logs:
{code:java}
2022-08-18 14:57:06,924 INFO  - [main:] ~ Found matched server id id1 with host 
port: atlas-0.atlas-headless.atlas.svc.cluster.local:21000 
(AtlasServerIdSelector:65)
2022-08-18 14:57:06,924 INFO  - [main:] ~ Starting leader election for id1 
(ActiveInstanceElectorService:112)
2022-08-18 14:57:06,933 INFO  - [main:] ~ Leader latch started for id1. 
(ActiveInstanceElectorService:118)
2022-08-18 14:57:06,991 INFO  - [main:] ~ AtlasJsonProvider() instantiated 
(AtlasJsonProvider:53)
2022-08-18 14:57:07,296 WARN  - [main-EventThread:] ~ Server instance with 
server id id1 is elected as leader (ActiveInstanceElectorService:152)
2022-08-18 14:57:07,296 WARN  - [main-EventThread:] ~ Instance becoming active 
from PASSIVE (ServiceState:88
 
———
 
2022-08-18 14:57:27,818 INFO  - [main-EventThread:] ~ Reacting to active state: 
initializing Kafka consumers (NotificationHookConsumer:421)
2022-08-18 14:57:27,819 INFO  - [main-EventThread:] ~ ==> 
KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1, 
autoCommitEnabled=false) (KafkaNotification:194)
2022-08-18 14:57:28,237 INFO  - [main-EventThread:] ~ <== 
KafkaNotification.createConsumers(notificationType=HOOK, numConsumers=1, 
autoCommitEnabled=false) (KafkaNotification:234)
2022-08-18 14:57:28,402 INFO  - [main-EventThread:] ~ ==> 
TaskManagement.instanceIsActive() (TaskManagement:94)
2022-08-18 14:57:28,402 INFO  - [main-EventThread:] ~ TaskManagement: Started! 
(TaskManagement:196)
2022-08-18 14:57:28,479 INFO  - [NotificationHookConsumer thread-0:] ~ 
[atlas-hook-consumer-thread]: Starting (Logging:66)
2022-08-18 14:57:28,481 INFO  - [NotificationHookConsumer thread-0:] ~ ==> 
HookConsumer doWork() (NotificationHookConsumer$HookConsumer:540)
2022-08-18 14:57:28,483 INFO  - [NotificationHookConsumer thread-0:] ~ Atlas 
Server is not ready. Waiting for 1000 milliseconds to retry... 
(NotificationHookConsumer$HookConsumer:940)
2022-08-18 14:57:28,485 INFO  - [main-EventThread:] ~ TaskManagement: Found: 0: 
Tasks in pending state. (TaskManagement:212)
2022-08-18 14:57:28,485 INFO  - [main-EventThread:] ~ <== 
TaskManagement.instanceIsActive() (TaskManagement:98)
2022-08-18 14:57:28,485 INFO  - [main-EventThread:] ~ ==> 
IndexRecoveryService.instanceIsActive() (IndexRecoveryService:117)
2022-08-18 14:57:28,485 INFO  - [main-EventThread:] ~ <== 
IndexRecoveryService.instanceIsActive() (IndexRecoveryService:121)
2022-08-18 14:57:28,486 INFO  - [index-health-monitor:] ~ Index Health Monitor: 
Starting... (IndexRecoveryService$RecoveryThread:175)
2022-08-18 14:57:28,487 ERROR - [main-EventThread:] ~ Got exception while 
activating (ActiveInstanceElectorService:162)
org.apache.atlas.exception.AtlasBaseException: ActiveInstanceState.update 
resulted in exception.
        at 
org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:119)
        at 
org.apache.atlas.web.service.ActiveInstanceElectorService.isLeader(ActiveInstanceElectorService.java:158)
        at 
org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:702)
        at 
org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:698)
        at 
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
        at 
org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at 
org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
        at 
org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:697)
        at 
org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:575)
        at 
org.apache.curator.framework.recipes.leader.LeaderLatch.access$600(LeaderLatch.java:65)
        at 
org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:626)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653)
        at 
org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
        at 
org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:627)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: java.lang.IllegalStateException: Expected state [STARTED] was 
[STOPPED]
        at 
org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:823)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:432)
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkExists(CuratorFrameworkImpl.java:459)
        at 
org.apache.atlas.web.service.ActiveInstanceState.update(ActiveInstanceState.java:109)
        ... 16 more
2022-08-18 14:57:28,487 WARN  - [main-EventThread:] ~ Server instance with 
server id id1 is removed as leader (ActiveInstanceElectorService:199)
2022-08-18 14:57:28,487 WARN  - [main-EventThread:] ~ Instance becoming passive 
from BECOMING_ACTIVE (ServiceState:119)
2022-08-18 14:57:28,487 INFO  - [main-EventThread:] ~ ==> 
IndexRecoveryService.instanceIsPassive() (IndexRecoveryService:126)
2022-08-18 14:57:28,487 INFO  - [main-EventThread:] ~ Index Health Monitor: 
Shutdown: Starting... (IndexRecoveryService$RecoveryThread:196)
2022-08-18 14:57:28,487 INFO  - [main-EventThread:] ~ Index Health Monitor: 
Shutdown: Done! (IndexRecoveryService$RecoveryThread:206)
2022-08-18 14:57:29,484 INFO  - [NotificationHookConsumer thread-0:] ~ Atlas 
Server is not ready. Waiting for 1000 milliseconds to retry... 
(NotificationHookConsumer$HookConsumer:940)
2022-08-18 14:57:30,484 INFO  - [NotificationHookConsumer thread-0:] ~ Atlas 
Server is not ready. Waiting for 1000 milliseconds to retry... 
(NotificationHookConsumer$HookConsumer:940) {code}
Running Atlas in non ha mode works fine

The zookeeper instance is also used for Cassandra and Solr and those don't seem 
to have any issues with Zookeeper.

It's unclear from the logs where the actual issue is.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ATLAS-4659) Atlas in HA mode fails to get healthy

Reply via email to