from:"Jens Rantil $JIRA$"

[jira] [Commented] (ZOOKEEPER-2464) NullPointerException on ContainerManager

2017-02-01 Thread Jens Rantil (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848218#comment-15848218
 ] 

Jens Rantil commented on ZOOKEEPER-2464:


Hi, we are also seeing this. We have a lot of zNodes building up in production 
as we speak (currently at 3127223). We have a temporary script that can remove 
older znodes, but this is a real big operational risk waiting to explode since 
failover will be very heavy.

What's the next step here? Can we help any way? Is there any date for when a 
release with a fix for this is in place?

> NullPointerException on ContainerManager
> 
>
> Key: ZOOKEEPER-2464
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2464
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.1
>Reporter: Stefano Salmaso
>Assignee: Jordan Zimmerman
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ContainerManagerTest.java, ZOOKEEPER-2464.patch
>
>
> I would like to expose you to a problem that we are experiencing.
> We are using a cluster of 7 zookeeper and we use them to implement a 
> distributed lock using Curator 
> (http://curator.apache.org/curator-recipes/shared-reentrant-lock.html)
> So .. we tried to play with the servers to see if everything worked properly 
> and we stopped and start servers to see that the system worked well
> (like stop 03, stop 05, stop 06, start 05, start 06, start 03)
> We saw a strange behavior.
> The number of znodes grew up without stopping (normally we had 4000 or 5000, 
> we got to 60,000 and then we stopped our application)
> In zookeeeper logs I saw this (on leader only, one every minute)
> 2016-07-04 14:53:50,302 [myid:7] - ERROR 
> [ContainerManagerTask:ContainerManager$1@84] - Error checking containers
> java.lang.NullPointerException
>at 
> org.apache.zookeeper.server.ContainerManager.getCandidates(ContainerManager.java:151)
>at 
> org.apache.zookeeper.server.ContainerManager.checkContainers(ContainerManager.java:111)
>at 
> org.apache.zookeeper.server.ContainerManager$1.run(ContainerManager.java:78)
>at java.util.TimerThread.mainLoop(Timer.java:555)
>at java.util.TimerThread.run(Timer.java:505)
> We have not yet deleted the data ... so the problem can be reproduced on our 
> servers



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (ZOOKEEPER-1869) zk server falling apart from quorum due to connection loss and couldn't connect back

2015-09-09 Thread Jens Rantil (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737084#comment-14737084
 ] 

Jens Rantil commented on ZOOKEEPER-1869:


Thank you for your reply, Flavio. FYI, we've now upgraded to the 3.4. I'll post 
here if we can recreate the issue.

> zk server falling apart from quorum due to connection loss and couldn't 
> connect back
> 
>
> Key: ZOOKEEPER-1869
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1869
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.5.0
> Environment: Using CentOS6 for running these zookeeper servers
>Reporter: Deepak Jagtap
>Priority: Critical
>
> We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in the 
> quorum.
> The problem we are facing is that one zookeeper server in the quorum falls 
> apart, and never becomes part of the cluster until we restart zookeeper 
> server on that node.
> Our interpretation from zookeeper logs on all nodes is as follows: 
> (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk server 3)
> Initially S3 is the leader while S1 and S2 are followers.
> S2 hits 46 sec latency while fsyncing write ahead log and results in loss of 
> connection with S3.
>  S3 in turn prints following error message:
> Unexpected exception causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
> Stack trace
> *** GOODBYE /169.254.1.2:47647(S2) 
> S2 in this case closes connection with S3(leader) and shuts down follower 
> with following log messages:
> Closing connection to leader, exception during packet send
> java.net.SocketException: Socket close
> Follower@194] - shutdown called
> java.lang.Exception: shutdown Follower
> After this point S3 could never reestablish connection with S2 and leader 
> election mechanism keeps failing. S3 now keeps printing following message 
> repeatedly:
> Cannot open channel to 2 at election address /169.254.1.2:3888
> java.net.ConnectException: Connection refused.
> While S3 is in this state, S2 repeatedly keeps printing following message:
> INFO 
> [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
>  - Accepted socket connection from /127.0.0.1:60667
> Exception causing close of session 0x0: ZooKeeperServer not running
> Closed socket connection for client /127.0.0.1:60667 (no session established 
> for client)
> Leader election never completes successfully and causing S2 to fall apart 
> from the quorum.
> S2 was out of quorum for almost 1 week.
> While debugging this issue, we found out that both election and peer 
> connection ports on S2  can't be telneted from any of the node (S1, S2, S3). 
> Network connectivity is not the issue. Later, we restarted the ZK server S2 
> (service zookeeper-server restart) -- now we could telnet to both the ports 
> and S2 joined the ensemble after a leader election attempt.
> Any idea what might be forcing S2 to get into a situation where it won't 
> accept any connections on the leader election and peer connection ports?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-1869) zk server falling apart from quorum due to connection loss and couldn't connect back

2015-07-31 Thread Jens Rantil (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648894#comment-14648894
]

Jens Rantil commented on ZOOKEEPER-1869:

Hi. We, too, probably hit this issue yesterday. Suddenly, our entire Zookeeper
ensemble was unresponsive and logs tells us a very similar scenario as
described by Deepak. Restart of all servers did not bring the ensemble back. We
had to stop all nodes, clean the data directory (we only use znodes for locks)
and start them all again. Since our logs contains some keys that could be a
little sensitive, I'm willing to e-mail logs by request.

We are running version 3.3.5.

While I don't have any logs, I suspect this has happened to us once a couple of
weeks back, too.

zk server falling apart from quorum due to connection loss and couldn't
connect back

Key: ZOOKEEPER-1869
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1869
Project: ZooKeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.5.0
Environment: Using CentOS6 for running these zookeeper servers
Reporter: Deepak Jagtap
Priority: Critical

We have deployed zookeeper version 3.5.0.1515976, with 3 zk servers in the
quorum.
The problem we are facing is that one zookeeper server in the quorum falls
apart, and never becomes part of the cluster until we restart zookeeper
server on that node.
Our interpretation from zookeeper logs on all nodes is as follows:
(For simplicity assume S1= zk server1, S2 = zk server2, S3 = zk server 3)
Initially S3 is the leader while S1 and S2 are followers.
S2 hits 46 sec latency while fsyncing write ahead log and results in loss of
connection with S3.
S3 in turn prints following error message:
Unexpected exception causing shutdown while sock still open
java.net.SocketTimeoutException: Read timed out
Stack trace
*** GOODBYE /169.254.1.2:47647(S2)
S2 in this case closes connection with S3(leader) and shuts down follower
with following log messages:
Closing connection to leader, exception during packet send
java.net.SocketException: Socket close
Follower@194] - shutdown called
java.lang.Exception: shutdown Follower
After this point S3 could never reestablish connection with S2 and leader
election mechanism keeps failing. S3 now keeps printing following message
repeatedly:
Cannot open channel to 2 at election address /169.254.1.2:3888
java.net.ConnectException: Connection refused.
While S3 is in this state, S2 repeatedly keeps printing following message:
INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /127.0.0.1:60667
Exception causing close of session 0x0: ZooKeeperServer not running
Closed socket connection for client /127.0.0.1:60667 (no session established
for client)
Leader election never completes successfully and causing S2 to fall apart
from the quorum.
S2 was out of quorum for almost 1 week.
While debugging this issue, we found out that both election and peer
connection ports on S2 can't be telneted from any of the node (S1, S2, S3).
Network connectivity is not the issue. Later, we restarted the ZK server S2
(service zookeeper-server restart) -- now we could telnet to both the ports
and S2 joined the ensemble after a leader election attempt.
Any idea what might be forcing S2 to get into a situation where it won't
accept any connections on the leader election and peer connection ports?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2464) NullPointerException on ContainerManager

[jira] [Commented] (ZOOKEEPER-1869) zk server falling apart from quorum due to connection loss and couldn't connect back

[jira] [Commented] (ZOOKEEPER-1869) zk server falling apart from quorum due to connection loss and couldn't connect back

3 matches

Site Navigation

Mail list logo

Footer information