[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum

Mate Szalay-Beko (Jira) Wed, 18 Mar 2020 01:03:17 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061481#comment-17061481
 ]


Mate Szalay-Beko commented on ZOOKEEPER-3756:
---------------------------------------------

I am not familiar with Kubernetes enough to understand all the yaml files you 
sent :)

so I started with the plain docker part. I was able to build your dockerfile 
(modified a bit, removed the GPG stuff and adding the tar.gz file), then I was 
able to start a standalone ZooKeeper Servr  container by: 
{code:java}
docker build -t zookeeper-3756 .
docker run --rm zookeeper-3756:latest
{code}
 

the modified Dockerfile:
{code:java}
FROM ubuntu:16.04

# install jre
RUN apt-get update -y && \
    apt-get upgrade -y && \
    apt-get install -y default-jre gosu netcat-openbsd wget

ARG DISTRO_NAME=zookeeper-3.5.8-SNAPSHOT
ARG ARCHIVE_NAME=apache-$DISTRO_NAME-bin

ENV ZOO_USER=zookeeper \
    ZOO_CONF_DIR=/conf \
    ZOO_DATA_DIR=/data \
    ZOO_DATA_LOG_DIR=/datalog \
    ZOO_PORT=2181 \
    ZOO_TICK_TIME=2000 \
    ZOO_INIT_LIMIT=5 \
    ZOO_SYNC_LIMIT=2 \
    ZOO_AUTOPURGE_RETAIN_COUNT=50 \
    ZOO_AUTOPURGE_INTERVAL=6 \
    ZOO_LOG_DIR=/logs \
    JMX_CONF_DIR=/etc/jmx

COPY  apache-zookeeper-3.5.8-SNAPSHOT-bin.tar.gz /

# Add a user and make dirs
RUN set -x \
    && useradd "$ZOO_USER" \
    && mkdir -p "$ZOO_DATA_LOG_DIR" "$ZOO_DATA_DIR" "$ZOO_CONF_DIR" 
"$ZOO_LOG_DIR" "$JMX_CONF_DIR" \
    && chown "$ZOO_USER:$ZOO_USER" "$ZOO_DATA_LOG_DIR" "$ZOO_DATA_DIR" 
"$ZOO_CONF_DIR" "$ZOO_LOG_DIR"

# Download Apache Zookeeper, verify its PGP signature, untar and clean up
RUN set -x && \
    cd / && \
    tar -xzf "$ARCHIVE_NAME.tar.gz" && \
    mv "$ARCHIVE_NAME/conf/"* "$ZOO_CONF_DIR" && \
    rm "$ARCHIVE_NAME.tar.gz" && \
    cd /$ARCHIVE_NAME && \
    wget -q 
"https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.11.0/jmx_prometheus_javaagent-0.11.0.jar";


WORKDIR $ARCHIVE_NAME
VOLUME ["$ZOO_DATA_DIR", "$ZOO_DATA_LOG_DIR"]

EXPOSE $ZOO_PORT 2888 3888

ENV PATH=$PATH:/$ARCHIVE_NAME/bin \
    ZOOCFGDIR=$ZOO_CONF_DIR

COPY docker-entrypoint.sh /
COPY jmx.yaml /etc/jmx/config.yaml
ENTRYPOINT ["/docker-entrypoint.sh"]
CMD ["zkServer.sh", "start-foreground"]

{code}
 

The fact that the standalone Zookeeper server started is a good sign :)
But I am not sure why you saw the {{"Could not find or load main class"}}  
error.

I will try to make a minimal Kubernetes setup where I can reproduce the problem 
with the connection timeout, using the original 3.5.7 version.

> Members failing to rejoin quorum
> --------------------------------
>
>                 Key: ZOOKEEPER-3756
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: leaderElection
>    Affects Versions: 3.5.6, 3.5.7
>            Reporter: Dai Shi
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>         Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, 
> jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml
>
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in 
> their configuration: 
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in 
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - 
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
>  - New election. My id =  2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message 
> format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
>         at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  
> id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN  
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting 
> SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart the 
> leader.
> However, if I remove server 4 and 5 from the configuration of server 1 or 2 
> (so only servers 1, 2, and 3 remain in the configuration file), then they can 
> rejoin the quorum fine. Is this expected and am I doing something wrong? Any 
> help or explanation would be greatly appreciated. Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum

Reply via email to