[ https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213674#comment-17213674 ]
maoling commented on ZOOKEEPER-3940: ------------------------------------ [~stanhend] Yes, good finding. These JNI transports add features specific to a particular platform. _*---> What is the process for getting the netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar file?*_ Adding _*<classifier>linux-x86_64</classifier>*_ into the _*pom.xml*_ and rebuild the source codes, we will get the jar(netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar) under zookeeper-assembly/target/apache-zookeeper-3.7.0-SNAPSHOT-bin/lib A potential improvement may be like this: {code:java} <build> <extensions> <extension> <groupId>kr.motd.maven</groupId> <artifactId>os-maven-plugin</artifactId> <version>1.5.0.Final</version> </extension> </extensions> ... </build> <dependencies> <dependency> <groupId>io.netty</groupId> <artifactId>netty-transport-native-epoll</artifactId> <version>${project.version}</version> <classifier>${os.detected.name}-${os.detected.arch}</classifier> </dependency> ... </dependencies>{code} Of course, in the current, we cannot judge the mis-formation of quorum caused by this netty realated issue. > Zookeeper restart of leader causes all zk nodes to not serve requests > --------------------------------------------------------------------- > > Key: ZOOKEEPER-3940 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.6.2 > Environment: dataDir=/data > dataLogDir=/datalog > tickTime=2000 > initLimit=10 > syncLimit=5 > maxClientCnxns=60 > autopurge.snapRetainCount=10 > autopurge.purgeInterval=24 > leaderServes=yes > standaloneEnabled=false > admin.enableServer=false > snapshot.trust.empty=true > audit.enable=true > 4lw.commands.whitelist=* > sslQuorum=true > quorumListenOnAllIPs=true > portUnification=false > serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory > ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.quorum.keyStore.password=******** > ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.quorum.trustStore.password=******** > ssl.quorum.protocol=TLSv1.2 > ssl.quorum.enabledProtocols=TLSv1.2 > ssl.client.enable=true > secureClientPort=2281 > client.portUnification=true > clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty > ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.keyStore.password=******** > ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.trustStore.password=******** > ssl.protocol=TLSv1.2 > ssl.enabledProtocols=TLSv1.2 > reconfigEnabled=false > server.1=zoo1:2888:3888:participant;2181 > server.2=zoo2:2888:3888:participant;2181 > server.3=zoo3:2888:3888:participant;2181 > Reporter: Stan Henderson > Priority: Critical > Attachments: nossl-zoo.cfg, zk-docker-containers-nossl.log.zip, > zk-docker-containers.log.zip, zoo.cfg, zoo1-docker-containers.log, > zoo1-docker-containers.log, zoo2-docker-containers.log, > zoo3-docker-containers.log > > > We have configured a 3 node zookeeper cluster using the 3.6.2 version in a > Docker version 1.12.1 containerized environment. This corresponds to Sep 16 > 20:03:01 in the attached docker-containers.log files. > NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 > branch > As a part of our testing, we have restarted each of the zookeeper nodes and > have seen the following behaviour: > zoo1, zoo2, and zoo3 healthy (zoo1 is leader) > We started our testing at approximately Sep 17 13:01:05 in the attached > docker-containers.log files. > 1. (simulate patching zoo2) > - restart zoo2 > - zk_synced_followers 1 > - zoo1 leader > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - waited 5 minutes with no change > - restart zoo3 > - zoo1 leader > - zk_synced_followers 1 > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - restart zoo2 > - no changes > - restart zoo3 > - zoo1 leader > - zk_synced_followers 2 > - zoo2 healthy > - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests) > - waited 5 minutes and zoo3 returned to healthy > 2. simulate patching zoo3 > - zoo1 leader > - restart zoo3 > - zk_synced_followers 2 > - zoo1, zoo2, and zoo3 healthy > 3. simulate patching zoo1 > - zoo1 leader > - restart zoo1 > - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently > serving requests) > - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44 > - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still > unhealthy (this step was not collected in the log files). > The third case in the above scenarios is the critical one since we are no > longer able to start any of the zk nodes. > > [~maoling] this issue may relate to > https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the > first and second cases above that I am working with [~blb93] on. -- This message was sent by Atlassian Jira (v8.3.4#803005)