[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13850507#comment-13850507
 ] 

Germán Blanco commented on ZOOKEEPER-1841:
------------------------------------------

Yes, sorry about that, it was not a clear explanation.
The test runs with only 3 servers out of a 5 server cluster. The other 2 
servers are simply not there. In every FLE timeout, the 3 servers that are 
running attempt to connect to the 2 servers that are not there, this attempt 
may take up to 5 seconds which is the connection timeout. Since the CPU is very 
loaded (this is a guess) and TCP REJECT messages don't seem to be happening 
very quickly, the 3 servers that are running do spend a lot of time waiting for 
that connection. Since connections are attempted in the single FLE thread (no 
multithreading there yet), this means that servers are locked waiting for up to 
5 seconds. During this time they don't respond to the rest of the servers that 
are running, which may trigger additional disconnects. This is what I mean with 
chaos. The result is that leader election doesn't finalize. Setting 
System.setProperty("zookeeper.cnxTimeout", "50"); means that servers that are 
running quickly discard servers that are not running and are able to process 
requests from the other servers that are running and leader election finalizes.

The loop that I was referring to is this one:
{noformat}
        while(qu.getPeer(index).peer.leader == null) {
            index++;
        }
{noformat}
I added the Thread.sleep there because I thought it was running for a long time 
and eating CPU, but that thought was wrong. It only runs from 1 to 3, and then 
if it doesn't find a leader throws a null pointer exception.
I hope that clarifies.

> problem in QuorumTest
> ---------------------
>
>                 Key: ZOOKEEPER-1841
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1841
>             Project: ZooKeeper
>          Issue Type: Sub-task
>          Components: tests
>    Affects Versions: 3.4.5
>         Environment: Windows, Java 1.7
>            Reporter: Germán Blanco
>            Assignee: Germán Blanco
>             Fix For: 3.4.6
>
>         Attachments: ZOOKEEPER-1841-branch3.4.patch, 
> ZOOKEEPER-1841-branch3.4.patch, ZOOKEEPER-1841-branch3.4.patch, 
> ZOOKEEPER-1841-branch3.4.patch, ZOOKEEPER-1841-branch3.4.patch
>
>
> QuorumTest.testNoLogBeforeLeaderEstablishment fails with Assertion: "NOt 
> following"



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to