[
https://issues.apache.org/jira/browse/HELIX-608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15613447#comment-15613447
]
Lei Xia commented on HELIX-608:
-------------------------------
There is a bug in the zkclient lib we are using. In zkClient.java, _connection
and _connection.getZookeeper() never returns null until the client is
explicitly closed. And once it is closed, a flag (_closed) is set. This flag
is checked in retryUntilConnected() before calling callback. For this reason,
neither Helix's extended zkClient nor the original zkClient checks the null
pointer in its various retry-able operations.
protected boolean exists(final String path, final boolean watch) {
......
try {
return retryUntilConnected(new Callable<Boolean>() {
@Override
public Boolean call() throws Exception {
return _connection.exists(path, watch);
}
});
.....
}
public <T> T retryUntilConnected(Callable<T> callable) throws
ZkInterruptedException, IllegalArgumentException, ZkException, RuntimeException
{
.....
while (true) {
if (_closed) {
throw new IllegalStateException("ZkClient already closed!");
}
try {
return callable.call();
} catch (ConnectionLossException e) {
...
waitForRetry();
} catch (SessionExpiredException e) {
....
waitForRetry();
} catch (KeeperException e) {
throw ZkException.create(e);
} catch (InterruptedException e) {
throw new ZkInterruptedException(e);
} catch (Exception e) {
throw ExceptionUtil.convertToRuntimeException(e);
}
.....
}
}
However, there is a bug in reconnect(), which closes the _connection, and
reconnect it. It does not set _closed flag after close the connection, so if
reconnect fails, then reconnect() returns with _connection be null and _closed
not set. We then see NPE if there are still pending read/writes to retry.
private void reconnect() {
getEventLock().lock();
try {
_connection.close();
_connection.connect(this);
} catch (InterruptedException e) {
throw new ZkInterruptedException(e);
} finally {
getEventLock().unlock();
}
}
https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java
The right way is to fix reconnect(), however, since it is private method,
Helix can not override it. This NPE exception happens when the client fails
to reconnect to zk server, which should be rare given zookeeper is supposed to
be highly available. However, once it happens, even if Helix checks it against
null, we can do nothing more than throw a different exception. Instead, I will
open a ticket to zkClient open source community to convince them to fix the
problem.
> NPE and unable to reconnect to zookeeper after a network outage
> ---------------------------------------------------------------
>
> Key: HELIX-608
> URL: https://issues.apache.org/jira/browse/HELIX-608
> Project: Apache Helix
> Issue Type: Bug
> Components: helix-core
> Affects Versions: 0.7.1
> Reporter: Changgeng Li
> Assignee: Lei Xia
>
> I noticed one of the participant is not a live instance in zookeeper after a
> network outage, while the java process is live. I have to restart the java
> process to make it live again.
> Found following logs:
> ERROR 2015-07-28 17:12:15,010 [main-EventThread]
> org.apache.zookeeper.ClientCnxn: Error while calling watcher
> java.lang.RuntimeException: Exception while restarting zk client
> at
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:462)
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.process(ZkClient.java:368)
> ~[zaaa.jar:?]
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
> [zaaa.jar:?]
> at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> [zaaa.jar:?]
> Caused by: org.I0Itec.zkclient.exception.ZkException: Unable to connect to
> zzookeeperhost:2181,zookeeperhost2.com:2181/a
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:66)
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935)
> ~[zaaa.jar:?]
> at
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458)
> ~[zaaa.jar:?]
> ... 3 more
> Caused by: java.net.UnknownHostException: zzookeeperhost: Temporary failure
> in name resolution
> at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> ~[?:1.7.0_72]
> at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
> ~[?:1.7.0_72]
> at
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName0(InetAddress.java:1246)
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName(InetAddress.java:1162)
> ~[?:1.7.0_72]
> at java.net.InetAddress.getAllByName(InetAddress.java:1098)
> ~[?:1.7.0_72]
> at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:387)
> ~[zaaa.jar:?]
> at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:332)
> ~[zaaa.jar:?]
> at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:383)
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:64)
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:935)
> ~[zaaa.jar:?]
> at
> org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:458)
> ~[zaaa.jar:?]
> ... 3 more
> INFO 2015-07-28 17:12:15,010 [main-EventThread]
> org.apache.zookeeper.ClientCnxn: EventThread shut down
> ERROR 2015-07-28 17:12:15,014
> [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a]
> org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of
> /zaaa/INSTANCES/10.211.12.21_9000/MESSAGES changed sent to
> org.apache.helix.manager.zk.ZkCallbackHandler@71bd5cfa]
> java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195)
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192)
> ~[zaaa.jar:?]
> at
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192)
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445)
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> [zaaa.jar:?]
> ERROR 2015-07-28 17:12:15,015
> [ZkClient-EventThread-184-zzookeeperhost:2181,zookeeperhost2.com:2181/a]
> org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of
> /zaaa/EXTERNALVIEW changed sent to
> org.apache.helix.manager.zk.ZkCallbackHandler@35d1655]
> java.lang.NullPointerException
> at org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:95)
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:195)
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient$2.call(ZkClient.java:192)
> ~[zaaa.jar:?]
> at
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> ~[zaaa.jar:?]
> at org.apache.helix.manager.zk.ZkClient.exists(ZkClient.java:192)
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:445)
> ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:566) ~[zaaa.jar:?]
> at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)
> [zaaa.jar:?]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)