[
https://issues.apache.org/jira/browse/TWILL-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598628#comment-14598628
]
Terence Yim commented on TWILL-139:
-----------------------------------
Found the root cause. It is caused by the org.I0Itec.zkclient.ZKClient used by
Kafka that has a race condition.
In {{ZKClient.waitUntilConnected()}}, it waits for signal fired from the ZK
event thread and then check if the current state is SyncConnected. However, it
is possible that the "SyncConnected" event is followed by the
"SaslAuthenticated" event immediately (in the ZK event thread), hence changing
the current state to "SaslAuthenticated" before the "waitUntilConnected" thread
has a chance to get the lock and compares with the current state.
> ApplicationMaster hangs during start when ZooKeeper SASL authentication is
> turned on
> ------------------------------------------------------------------------------------
>
> Key: TWILL-139
> URL: https://issues.apache.org/jira/browse/TWILL-139
> Project: Apache Twill
> Issue Type: Bug
> Components: core, yarn
> Affects Versions: 0.5.0-incubating, 0.4.1-incubating
> Reporter: Terence Yim
> Assignee: Terence Yim
> Priority: Blocker
> Fix For: 0.6.0-incubating
>
>
> It is caused by a race condition when one {{ZKClient}} instance is performing
> the authentication while the {{EmbeddedKafkaServer}} is trying to start and
> connect to zookeeper.
> Here is the main method to reproduce the issue:
> {noformat}
> public static void main(String[] args) throws Exception {
> String zkStr = args[0];
> ZKClientService zkClient = ZKClientService.Builder.of(zkStr).build();
> EmbeddedKafkaServer kafka = new
> EmbeddedKafkaServer(generateKafkaConfig(zkStr));
> zkClient.startAndWait(); // <-- This returns when SyncConnected
> kafka.startAndWait(); // <-- This call hangs and never return
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)