Vladimir Prus created ZEPPELIN-5334:
---------------------------------------
Summary: DNS race condition connecting to K8S interpreter pod
Key: ZEPPELIN-5334
URL: https://issues.apache.org/jira/browse/ZEPPELIN-5334
Project: Zeppelin
Issue Type: Bug
Components: interpreter-launcher
Affects Versions: 0.9.0
Reporter: Vladimir Prus
Apologies in advance for a bug report that is impossible to easily reproduce -
I cannot reproduce it at will myself.
>From time to time, running a paragraph from a fresh start fails with an error
>such as
{code:java}
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException:
java.net.UnknownHostException: livy-wjmmsl.spark.svc at
org.apache.zeppelin.interpreter.remote.PooledRemoteClient.callRemoteFunction(PooledRemoteClient.java:115)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:99){code}
Examining logs of zeppelin server reveals this sequence of events
{code:java}
07:00:03.662, zeppelin-server: Interpreter pod created
livy-wjmmsl.spark.svc:12321
07:00:03.709, dnsmasq: Received DNS query for \"livy-wjmmsl.spark.svc.\
07:00:03.709, dnsmasq: Querying nameserver 172.20.0.10:53 question
livy-wjmmsl.spark.svc.
07:00:03.725, zeppelin-server: java.net.UnknownHostException:
livy-wjmmsl.spark.svc
{code}
It seems that Zeppelin assumes that as soon as pod is running, it can be looked
up using DNS domain name. However, coredns needs to learn about this new pod
from API server and update its records, and it takes non-zero time. I would
propose that an exponential timeout is used when resolving DNS name for a new
interpreter.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)