Charles Hedrick created ZEPPELIN-3816:
-----------------------------------------
Summary: after moderate usage, can no longer use Spark2
Key: ZEPPELIN-3816
URL: https://issues.apache.org/jira/browse/ZEPPELIN-3816
Project: Zeppelin
Issue Type: Bug
Components: zeppelin-interpreter
Environment: h3. spark2 %spark2, %spark2.sql, %spark2.dep,
%spark2.pyspark, %spark2.r
spark ui edit restart remove
h5. Option
The interpreter will be instantiated Per User in isolated process.
User Impersonate
Connect to existing process
Set permission
h5. Properties
||name||value||
|SPARK_HOME|/usr/hdp/current/spark2-client/|
|args| |
|master|local[*]|
|spark.app.name|Zeppelin|
|spark.cores.max| |
|spark.executor.memory| |
|zeppelin.R.cmd|R|
|zeppelin.R.image.width|100%|
|zeppelin.R.knitr|true|
|zeppelin.R.render.options|out.format = 'html', comment = NA, echo = FALSE,
results = 'asis', message = F, warning = F|
|zeppelin.dep.additionalRemoteRepository|spark-packages,http://dl.bintray.com/spark-packages/maven,false;|
|zeppelin.dep.localrepo|local-repo|
|zeppelin.interpreter.localRepo|/usr/hdp/current/zeppelin-server/local-repo/2DRMGSB7A|
|zeppelin.interpreter.output.limit|102400|
|zeppelin.pyspark.python|/usr/local/bin/zsparkpy|
|zeppelin.spark.concurrentSQL|false|
|zeppelin.spark.importImplicit|true|
|zeppelin.spark.maxResult|1000|
|zeppelin.spark.printREPLOutput|true|
|zeppelin.spark.sql.stacktrace|false|
|zeppelin.spark.useHiveContext|true|
Reporter: Charles Hedrick
Fix For: 0.7.3
This is Zeppelin installed as part of HDP 2.6.3.0-235
We have a Zeppelin system being used by a large class. Everything except MD is
configured to run with user impersonation, isolated. Users primarily use spark2.
After a while the system becomes unusable. I've been restarting once a day, but
today even that wasn't enough. Once the problem occurs we get this kind of
error:
Restarting my interpreter doesn't help, and indeed I believe this happens to
all users.
Livy2 still works.
Our system is kerberized. Users get Kerberos credentials when they login
automatically (via PAM).
ERROR [2018-10-17 16:04:55,608] (\{Thread-2817}
RemoteInterpreterEventPoller.java[run]:113) - Can't get RemoteInterpreterEvent
org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getEvent(RemoteInterpreterService.java:429)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getEvent(RemoteInterpreterService.java:417)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterEventPoller.run(RemoteInterpreterEventPoller.java:110)
ERROR [2018-10-17 16:04:55,620] (\{Thread-2819}
JobProgressPoller.java[run]:54) - Can not get or update progress
org.apache.zeppelin.interpreter.InterpreterException:
org.apache.thrift.transport.TTransportException: java.net.SocketException:
Connection reset
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:500)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:121)
at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:333)
at
org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:51)
Caused by: org.apache.thrift.transport.TTransportException:
java.net.SocketException: Connection reset
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpreterService.java:313)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterService.java:298)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:497)
... 3 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
... 11 more
ERROR [2018-10-17 16:04:55,618] (\{pool-2-thread-37} Job.java[run]:188) - Job
failed
org.apache.zeppelin.interpreter.InterpreterException:
org.apache.thrift.transport.TTransportException
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:426)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:101)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:410)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:266)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:250)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:398)
... 11 more
ERROR [2018-10-17 16:04:55,625] (\{pool-2-thread-37}
RemoteScheduler.java[getStatus]:256) - Can't get status information
org.apache.zeppelin.interpreter.InterpreterException:
org.apache.thrift.transport.TTransportException: java.net.ConnectException:
Connection refu\
sed (Connection refused)
I think at this point it's repeating.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)