[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185454#comment-13185454
 ] 

Henry Robinson commented on ZOOKEEPER-1294:
-------------------------------------------

So, after some investigation, I've found out what's happening with 
testNoLogBeforeLeaderEstablishment.

The patch changes the locking in Leader.java; now the lock around the 
sync-and-ping loop is on the forwardingFollowers set. The call to ping() with 
that lock held then takes the lock on the leader object. 

In the failing test runs, at the same time the ProposalRequestProcessor has 
locked the leader object in order to make a proposal in Leader.propose(). This 
then calls sendPacket, which (tries to) lock on forwardingFollowers. 

This is a classic deadlock - the threads try to take the same locks in a 
different order. Although there are a few options, I think actually the patch 
*shouldn't* be changing the set to forwardingFollowers, but should be using 
learners as before. This is because observers should be pinged as well, I 
think, so that they don't think they're dead. Instead, the code should 
explicitly test whether a learner is a PARTICIPANT as below:

{code}
synchronized (learners) {
                    for (LearnerHandler f : learners) {
                        if (f.synced() && f.getLearnerType() == 
LearnerType.PARTICIPANT) {
                            syncedCount++;
                            syncedSet.add(f.getSid());
                        }
                        f.ping();
                    }
                }
{code}

So only learners get added to the sync set, but everyone gets pinged. This 
seems to fix the problem with this test, at least, for me. Any thoughts?
                
> One of the zookeeper server is not accepting any requests
> ---------------------------------------------------------
>
>                 Key: ZOOKEEPER-1294
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1294
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>         Environment: 3 Zookeeper + 3 Observer with SuSe-11
>            Reporter: amith
>            Assignee: kavita sharma
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1294-1.patch, ZOOKEEPER-1294.patch
>
>
> In zoo.cfg i have configured as
> server.1 = XX.XX.XX.XX:65175:65173
> server.2 = XX.XX.XX.XX:65185:65183
> server.3 = XX.XX.XX.XX:65195:65193
> server.4 = XX.XX.XX.XX:65205:65203:observer
> server.5 = XX.XX.XX.XX:65215:65213:observer
> server.6 = XX.XX.XX.XX:65225:65223:observer
> Like above I have configured 3 PARTICIPANTS and 3 OBSERVERS
> in the cluster of 6 zookeepers
> Steps to reproduce the defect
> 1. Start all the 3 participant zookeeper
> 2. Stop all the participant zookeeper
> 3. Start zookeeper 1(Participant)
> 4. Start zookeeper 2(Participant)
> 5. Start zookeeper 4(Observer)
> 6. Create a persistent node with external client and close it
> 7. Stop the zookeeper 1(Participant neo quorum is unstable)
> 8. Create a new client and try to find the node created b4 using exists api 
> (will fail since quorum not statisfied)
> 9. Start the Zookeeper 1 (Participant stabilise the quorum)
> Now check the observer using 4 letter word (Server.4)
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat 
> | netcat localhost 65200
> Zookeeper version: 3.3.2-1031432, built on 11/05/2010 05:32 GMT
> Clients:
>  /127.0.0.1:46370[0](queued=0,recved=1,sent=0)
> Latency min/avg/max: 0/0/0
> Received: 1
> Sent: 0
> Outstanding: 0
> Zxid: 0x100000003
> Mode: observer
> Node count: 5
> check the participant 2 with 4 letter word
> Latency min/avg/max: 22/48/83
> Received: 39
> Sent: 3
> Outstanding: 35
> Zxid: 0x100000003
> Mode: leader
> Node count: 5
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin #
> check the participant 1 with 4 letter word
> linux-216:/home/amith/CI/source/install/zookeeper/zookeeper2/bin # echo stat 
> | netcat localhost 65170
> This ZooKeeper instance is not currently serving requests
> We can see the participant1 logs filled with
> 2011-11-08 15:49:51,360 - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:65170:NIOServerCnxn@642] - Exception 
> causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not 
> running
> Problem here is participent1 is not responding / accepting any requests

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to