[jira] [Commented] (ZOOKEEPER-2464) NullPointerException on ContainerManager

2017-02-01 Thread Jens Rantil (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848218#comment-15848218
 ] 

Jens Rantil commented on ZOOKEEPER-2464:


Hi, we are also seeing this. We have a lot of zNodes building up in production 
as we speak (currently at 3127223). We have a temporary script that can remove 
older znodes, but this is a real big operational risk waiting to explode since 
failover will be very heavy.

What's the next step here? Can we help any way? Is there any date for when a 
release with a fix for this is in place?

> NullPointerException on ContainerManager
> 
>
> Key: ZOOKEEPER-2464
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2464
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.1
>Reporter: Stefano Salmaso
>Assignee: Jordan Zimmerman
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ContainerManagerTest.java, ZOOKEEPER-2464.patch
>
>
> I would like to expose you to a problem that we are experiencing.
> We are using a cluster of 7 zookeeper and we use them to implement a 
> distributed lock using Curator 
> (http://curator.apache.org/curator-recipes/shared-reentrant-lock.html)
> So .. we tried to play with the servers to see if everything worked properly 
> and we stopped and start servers to see that the system worked well
> (like stop 03, stop 05, stop 06, start 05, start 06, start 03)
> We saw a strange behavior.
> The number of znodes grew up without stopping (normally we had 4000 or 5000, 
> we got to 60,000 and then we stopped our application)
> In zookeeeper logs I saw this (on leader only, one every minute)
> 2016-07-04 14:53:50,302 [myid:7] - ERROR 
> [ContainerManagerTask:ContainerManager$1@84] - Error checking containers
> java.lang.NullPointerException
>at 
> org.apache.zookeeper.server.ContainerManager.getCandidates(ContainerManager.java:151)
>at 
> org.apache.zookeeper.server.ContainerManager.checkContainers(ContainerManager.java:111)
>at 
> org.apache.zookeeper.server.ContainerManager$1.run(ContainerManager.java:78)
>at java.util.TimerThread.mainLoop(Timer.java:555)
>at java.util.TimerThread.run(Timer.java:505)
> We have not yet deleted the data ... so the problem can be reproduced on our 
> servers



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ZOOKEEPER-1869) zk server falling apart from quorum due to connection loss and couldn't connect back

2015-09-09 Thread Jens Rantil (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737084#comment-14737084
 ] 

Jens Rantil commented on ZOOKEEPER-1869:


Thank you for your reply, Flavio. FYI, we've now upgraded to the 3.4. I'll post 
here if we can recreate the issue.

> zk server falling apart from quorum due to connection loss and couldn't 
> connect back
> 
>
> Key: ZOOKEEPER-1869
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1869
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.0
> Environment: Using CentOS6 for running these zookeeper servers
>Reporter: Deepak Jagtap
>Priority: Critical
>
> We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in the 
> quorum.
> The problem we are facing is that one zookeeper server in the quorum falls 
> apart, and never becomes part of the cluster until we restart zookeeper 
> server on that node.
> Our interpretation from zookeeper logs on all nodes is as follows: 
> (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk server 3)
> Initially S3 is the leader while S1 and S2 are followers.
> S2 hits 46 sec latency while fsyncing write ahead log and results in loss of 
> connection with S3.
>  S3 in turn prints following error message:
> Unexpected exception causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
> Stack trace
> *** GOODBYE /169.254.1.2:47647(S2) 
> S2 in this case closes connection with S3(leader) and shuts down follower 
> with following log messages:
> Closing connection to leader, exception during packet send
> java.net.SocketException: Socket close
> Follower@194] - shutdown called
> java.lang.Exception: shutdown Follower
> After this point S3 could never reestablish connection with S2 and leader 
> election mechanism keeps failing. S3 now keeps printing following message 
> repeatedly:
> Cannot open channel to 2 at election address /169.254.1.2:3888
> java.net.ConnectException: Connection refused.
> While S3 is in this state, S2 repeatedly keeps printing following message:
> INFO 
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /127.0.0.1:60667
> Exception causing close of session 0x0: ZooKeeperServer not running
> Closed socket connection for client /127.0.0.1:60667 (no session established 
> for client)
> Leader election never completes successfully and causing S2 to fall apart 
> from the quorum.
> S2 was out of quorum for almost 1 week.
> While debugging this issue, we found out that both election and peer 
> connection ports on S2  can't be telneted from any of the node (S1, S2, S3). 
> Network connectivity is not the issue. Later, we restarted the ZK server S2 
> (service zookeeper-server restart) -- now we could telnet to both the ports 
> and S2 joined the ensemble after a leader election attempt.
> Any idea what might be forcing S2 to get into a situation where it won't 
> accept any connections on the leader election and peer connection ports?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-1869) zk server falling apart from quorum due to connection loss and couldn't connect back

2015-07-31 Thread Jens Rantil (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648894#comment-14648894
 ] 

Jens Rantil commented on ZOOKEEPER-1869:


Hi. We, too, probably hit this issue yesterday. Suddenly, our entire Zookeeper 
ensemble was unresponsive and logs tells us a very similar scenario as 
described by Deepak. Restart of all servers did not bring the ensemble back. We 
had to stop all nodes, clean the data directory (we only use znodes for locks) 
and start them all again. Since our logs contains some keys that could be a 
little sensitive, I'm willing to e-mail logs by request.

We are running version 3.3.5.

While I don't have any logs, I suspect this has happened to us once a couple of 
weeks back, too.

 zk server falling apart from quorum due to connection loss and couldn't 
 connect back
 

 Key: ZOOKEEPER-1869
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1869
 Project: ZooKeeper
  Issue Type: Bug
  Components: quorum
Affects Versions: 3.5.0
 Environment: Using CentOS6 for running these zookeeper servers
Reporter: Deepak Jagtap
Priority: Critical

 We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in the 
 quorum.
 The problem we are facing is that one zookeeper server in the quorum falls 
 apart, and never becomes part of the cluster until we restart zookeeper 
 server on that node.
 Our interpretation from zookeeper logs on all nodes is as follows: 
 (For simplicity assume S1= zk server1, S2 = zk server2, S3 = zk server 3)
 Initially S3 is the leader while S1 and S2 are followers.
 S2 hits 46 sec latency while fsyncing write ahead log and results in loss of 
 connection with S3.
  S3 in turn prints following error message:
 Unexpected exception causing shutdown while sock still open
 java.net.SocketTimeoutException: Read timed out
 Stack trace
 *** GOODBYE /169.254.1.2:47647(S2) 
 S2 in this case closes connection with S3(leader) and shuts down follower 
 with following log messages:
 Closing connection to leader, exception during packet send
 java.net.SocketException: Socket close
 Follower@194] - shutdown called
 java.lang.Exception: shutdown Follower
 After this point S3 could never reestablish connection with S2 and leader 
 election mechanism keeps failing. S3 now keeps printing following message 
 repeatedly:
 Cannot open channel to 2 at election address /169.254.1.2:3888
 java.net.ConnectException: Connection refused.
 While S3 is in this state, S2 repeatedly keeps printing following message:
 INFO 
 [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
  - Accepted socket connection from /127.0.0.1:60667
 Exception causing close of session 0x0: ZooKeeperServer not running
 Closed socket connection for client /127.0.0.1:60667 (no session established 
 for client)
 Leader election never completes successfully and causing S2 to fall apart 
 from the quorum.
 S2 was out of quorum for almost 1 week.
 While debugging this issue, we found out that both election and peer 
 connection ports on S2  can't be telneted from any of the node (S1, S2, S3). 
 Network connectivity is not the issue. Later, we restarted the ZK server S2 
 (service zookeeper-server restart) -- now we could telnet to both the ports 
 and S2 joined the ensemble after a leader election attempt.
 Any idea what might be forcing S2 to get into a situation where it won't 
 accept any connections on the leader election and peer connection ports?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)