[ https://issues.apache.org/jira/browse/FLINK-28947 ]
DavidLiu deleted comment on FLINK-28947:
----------------------------------
was (Author: JIRAUSER289843):
It seems Curator has bug. I can help fix it.
> Curator framework fails with NullPointerException
> -------------------------------------------------
>
> Key: FLINK-28947
> URL: https://issues.apache.org/jira/browse/FLINK-28947
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.15.1
> Reporter: Juha
> Priority: Major
>
> I'm getting the following error in JobManager and as a result JobManager
> exits.
> {code:java}
> Aug 12 06:37:30 server_name java[173]: [2022-08-12 06:37:30,491] ERROR
> Background exception was not retry-able or retry gave up
> (org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl:733)
> Aug 12 06:37:30 server_name java[173]: java.lang.NullPointerException: null
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.utils.Compatibility.getHostAddress(Compatibility.java:116)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:926)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:683)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: [2022-08-12 06:37:30,493] ERROR
> Unhandled error in curator framework, error message: Background exception was
> not retry-able or retry gave up
> (org.apache.flink.runtime.util.ZooKeeperUtils:292)
> Aug 12 06:37:30 server_name java[173]: java.lang.NullPointerException: null
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.utils.Compatibility.getHostAddress(Compatibility.java:116)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:926)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:683)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: [2022-08-12 06:37:30,494] ERROR Fatal
> error occurred while executing the TaskManager. Shutting it down...
> (org.apache.flink.runtime.taskexecutor.TaskManagerRunner:427)
> Aug 12 06:37:30 server_name java[173]: java.lang.NullPointerException: null
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.utils.Compatibility.getHostAddress(Compatibility.java:116)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150)
> ~[flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:926)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:683)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> Aug 12 06:37:30 server_name java[173]: at
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> [flink-shaded-zookeeper-3.5.9.jar:3.5.9-15.0]
> {code}
> Steps
> * Create three servers
> * Run Flink JobManager and TaskManager on all of them (let's call these A, B
> and C). Use ZooKeeper HA Services.
> * Everything works as expected
> * Add a new server (D).
> * Shutdown server C
> * This error can be seen on both servers A and D. I didn't check B and C.
> This can be reproduced (apparently) with every execution.
> I'm using Flink 1.15.1. Actually I'm migrating from 1.13.X to 1.15.X. I'm not
> totally sure whether this ever happens on 1.13.X, but it seems to _always_
> happen on 1.15.1.
> I looked using debugger what's going on in the JobManager:
> {code:java}
> main-EventThread[1] where
> [1]
> org.apache.flink.shaded.curator5.org.apache.curator.utils.Compatibility.getHostAddress
> (Compatibility.java:116)
> [2]
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString
> (EnsembleTracker.java:185)
> [3]
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.processConfigData
> (EnsembleTracker.java:206)
> [4]
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker.access$300
> (EnsembleTracker.java:50)
> [5]
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.EnsembleTracker$2.processResult
> (EnsembleTracker.java:150)
> [6]
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback
> (CuratorFrameworkImpl.java:926)
> [7]
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation
> (CuratorFrameworkImpl.java:683)
> [8]
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation
> (WatcherRemovalFacade.java:152)
> [9]
> org.apache.flink.shaded.curator5.org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult
> (GetConfigBuilderImpl.java:222)
> [10]
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent
> (ClientCnxn.java:598)
> [11]
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run
> (ClientCnxn.java:510)
> main-EventThread[1] dump address
> address = {
> holder: instance of
> java.net.InetSocketAddress$InetSocketAddressHolder(id=8302)
> serialVersionUID: 5076001401234631237
> serialPersistentFields: instance of java.io.ObjectStreamField[3] (id=8303)
> UNSAFE: instance of jdk.internal.misc.Unsafe(id=8304)
> FIELDS_OFFSET: 12
> java.net.SocketAddress.serialVersionUID: 5215720748342549866
> }
> main-EventThread[1] dump address.holder
> address.holder = {
> hostname: "host_name_here"
> addr: null
> port: 2888
> }
> main-EventThread[1] print address.getAddress()
> address.getAddress() = null
> {code}
> (The hostname has been changed).
> It can be seen that on line 116 of Compatibility.java
> (https://github.com/apache/curator/blob/d65669b64f003326c98843b32b997e3ffab1e442/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java#L116)
> there's this
> {code}
> return (address != null) ? address.getAddress().getHostAddress() :
> "unknown";
> {code}
> Here {{address.getAddress()}} returns {{null}} causing the crash.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)