Ruslan Dautkhanov created ZEPPELIN-1984:
-------------------------------------------
Summary: Zeppelin Server doesn't catch all exception when
launching a new interpreter process
Key: ZEPPELIN-1984
URL: https://issues.apache.org/jira/browse/ZEPPELIN-1984
Project: Zeppelin
Issue Type: Bug
Components: zeppelin-interpreter, zeppelin-server
Affects Versions: 0.7.0
Environment: Zeppelin server from a month old master snapshot
Reporter: Ruslan Dautkhanov
We saw below exception stack when Zeppelin server tries to start a new
interpreter process, for example, Spark interpreter. It was really hard to
debug and the only way to capture real root cause, was to add
{code}
LOG="/tmp/interpreter.sh-$$.log"
date >> $LOG
set -x
exec >> $LOG
exec 2>&1
{code} to $zeppelinhome/bin/interpreter.sh file
so all stdout and stderr from the interpreter.sh would go to that file.
So it showed real problem
{noformat}
Exception in thread "main" org.apache.spark.SparkException: Keytab file:
/home/<username>/.kt does not exist
at
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
...
{noformat}
while all other Zeppelin logs and note output was showing misleading
"Connection refused" - see below stack
{noformat}
ERROR [2017-01-18 16:54:38,533] ({pool-2-thread-2}
NotebookServer.java[afterStatusChange]:1645) - Error
org.apache.zeppelin.interpreter.InterpreterException:
org.apache.zeppelin.interpreter.InterpreterException:
org.apache.thrift.transport.TTransportException: java.net.ConnectException:
Connection refused
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.init(RemoteInterpreter.java:232)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:400)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:105)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:316)
at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
at
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
...
{noformat}
The issue might be that after interpreter.sh is started and exits right away -
https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L121
this does not get captured anywhere. The only sign you'll see on Zeppelin side
is "Connection refused" as Zeppelin wouldn't be able to connect to a new
interpreter process. We saw different root causes (above error from
spark-submit that keytab file doesn't exist is just one of them), and every
time we had to add tracing into interpreter.sh to capture real problem.
We think there are two possible ways to improve that:
1) capture fact that interpreter.sh bails out (and don't try to connect in
https://github.com/apache/zeppelin/blob/master/zeppelin-interpreter/src/main/java/org/apache/zeppelin/interpreter/remote/RemoteInterpreterManagedProcess.java#L132
as it'll produce expected "Connection refused")
2) if one point 1) isn't possible for some reason (although I don't why that
would be) - at least capture errors produced by interpreter.sh so error stack
in Zeppelin log files and paragraph output that kicked off interpreter start
would have some meaningful information.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)