Andreas Weise created ZEPPELIN-3435:
---------------------------------------
Summary: Interpreter timeout lifecycle leads to interpreter
process orphans
Key: ZEPPELIN-3435
URL: https://issues.apache.org/jira/browse/ZEPPELIN-3435
Project: Zeppelin
Issue Type: Bug
Components: zeppelin-zengine
Affects Versions: 0.8.0
Reporter: Andreas Weise
We have configured to Timeout our interpreters auf 60 minutes. From time to
time an interpreter is not closed properly. The remote interpreter process is
still alive. This behavior is non-deterministic.
When timeout is reached only the following is logged:
{noformat}
INFO [2018-04-27 13:06:44,329] ({Timer-0} TimeoutLifecycleManager.java[run]:49)
- InterpreterGroup spark:shared_process is timeout.
INFO [2018-04-27 13:06:44,329] ({Timer-0}
ManagedInterpreterGroup.java[close]:89) - Close InterpreterGroup:
spark:shared_process
INFO [2018-04-27 13:06:44,329] ({Timer-0}
ManagedInterpreterGroup.java[close]:100) - Close Session: 2D8VRV5M6 for
interpreter setting: spark
WARN [2018-04-27 13:06:44,329] ({Timer-0} RemoteInterpreter.java[close]:199) -
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.SparkInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) -
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.SparkSqlInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) -
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.DepInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) -
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.PySparkInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) -
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.IPySparkInterpreter
WARN [2018-04-27 13:06:44,330] ({Timer-0} RemoteInterpreter.java[close]:199) -
close is called when RemoterInterpreter is not opened for org.apache.zeppelin.
spark.SparkRInterpreter
INFO [2018-04-27 13:06:44,330] ({Timer-0}
ManagedInterpreterGroup.java[close]:105) - Remove this InterpreterGroup:
spark:shared_process as all the
sessions are closed
{noformat}
For *successful* shutdown situation we also see those log entries, but they are
missing in the case of this bug:
{noformat}
INFO [2018-04-27 13:11:20,485] ({Timer-0}
ManagedInterpreterGroup.java[close]:105) - Remove this InterpreterGroup:
spark_FKT_Reports:shared_process as all the sessions are closed
INFO [2018-04-27 13:11:20,485] ({Timer-0}
ManagedInterpreterGroup.java[close]:108) - Kill RemoteInterpreterProcess
INFO [2018-04-27 13:11:20,485] ({Timer-0}
RemoteInterpreterManagedProcess.java[stop]:220) - Kill interpreter process
ERROR [2018-04-27 13:11:20,692] ({Thread-71907}
RemoteInterpreterEventPoller.java[run]:257) - Can not get
RemoteInterpreterEvent because it is shutdown.
ERROR [2018-04-27 13:11:20,692] ({pool-30-thread-1}
AppendOutputRunner.java[run]:68) - Wait for OutputBuffer queue interrupted: null
WARN [2018-04-27 13:11:22,991] ({Timer-0}
RemoteInterpreterManagedProcess.java[stop]:230) - ignore the exception when
shutting down
INFO [2018-04-27 13:11:22,993] ({Timer-0}
RemoteInterpreterManagedProcess.java[stop]:238) - Remote process terminated
{noformat}
So in case of the Bug line 108 of ManagedInterpreterGroup is never reached.
When triggering a notebook after the timeout has occured, a new additional
interpreter gets started and the first one stays alive forever.
Also restart the interpreter does not kill the first process.
Only after restarting zeppelin, all interpreter process orphans are killed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)