Vladimir Prus created ZEPPELIN-5334:
---------------------------------------

             Summary: DNS race condition connecting to K8S interpreter pod
                 Key: ZEPPELIN-5334
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-5334
             Project: Zeppelin
          Issue Type: Bug
          Components: interpreter-launcher
    Affects Versions: 0.9.0
            Reporter: Vladimir Prus


Apologies in advance for a bug report that is impossible to easily reproduce - 
I cannot reproduce it at will myself.

>From time to time, running a paragraph from a fresh start fails with an error 
>such as

 
{code:java}
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException: 
java.net.UnknownHostException: livy-wjmmsl.spark.svc at 
org.apache.zeppelin.interpreter.remote.PooledRemoteClient.callRemoteFunction(PooledRemoteClient.java:115)
 at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:99){code}
 

Examining logs of zeppelin server reveals this sequence of events



 
{code:java}
07:00:03.662, zeppelin-server: Interpreter pod created 
livy-wjmmsl.spark.svc:12321
07:00:03.709, dnsmasq: Received DNS query for \"livy-wjmmsl.spark.svc.\
07:00:03.709, dnsmasq: Querying nameserver 172.20.0.10:53 question 
livy-wjmmsl.spark.svc.
07:00:03.725, zeppelin-server: java.net.UnknownHostException: 
livy-wjmmsl.spark.svc
{code}
It seems that Zeppelin assumes that as soon as pod is running, it can be looked 
up using DNS domain name. However, coredns needs to learn about this new pod 
from API server and update its records, and it takes non-zero time. I would 
propose that an exponential timeout is used when resolving DNS name for a new 
interpreter.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to