Charles Hedrick created ZEPPELIN-3816:
-----------------------------------------

             Summary: after moderate usage, can no longer use Spark2
                 Key: ZEPPELIN-3816
                 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3816
             Project: Zeppelin
          Issue Type: Bug
          Components: zeppelin-interpreter
         Environment: h3. spark2 %spark2, %spark2.sql, %spark2.dep, 
%spark2.pyspark, %spark2.r 
 spark ui  edit  restart  remove
h5. Option
The interpreter will be instantiated Per User  in isolated  process. 
 
User Impersonate
Connect to existing process
Set permission
 
h5. Properties
||name||value||
|SPARK_HOME|/usr/hdp/current/spark2-client/|
|args| |
|master|local[*]|
|spark.app.name|Zeppelin|
|spark.cores.max| |
|spark.executor.memory| |
|zeppelin.R.cmd|R|
|zeppelin.R.image.width|100%|
|zeppelin.R.knitr|true|
|zeppelin.R.render.options|out.format = 'html', comment = NA, echo = FALSE, 
results = 'asis', message = F, warning = F|
|zeppelin.dep.additionalRemoteRepository|spark-packages,http://dl.bintray.com/spark-packages/maven,false;|
|zeppelin.dep.localrepo|local-repo|
|zeppelin.interpreter.localRepo|/usr/hdp/current/zeppelin-server/local-repo/2DRMGSB7A|
|zeppelin.interpreter.output.limit|102400|
|zeppelin.pyspark.python|/usr/local/bin/zsparkpy|
|zeppelin.spark.concurrentSQL|false|
|zeppelin.spark.importImplicit|true|
|zeppelin.spark.maxResult|1000|
|zeppelin.spark.printREPLOutput|true|
|zeppelin.spark.sql.stacktrace|false|
|zeppelin.spark.useHiveContext|true|
            Reporter: Charles Hedrick
             Fix For: 0.7.3


This is Zeppelin installed as part of HDP 2.6.3.0-235

We have a Zeppelin system being used by a large class. Everything except MD is 
configured to run with user impersonation, isolated. Users primarily use spark2.

After a while the system becomes unusable. I've been restarting once a day, but 
today even that wasn't enough. Once the problem occurs we get this kind of 
error:

Restarting my interpreter doesn't help, and indeed I believe this happens to 
all users.

Livy2 still works.

Our system is kerberized. Users get Kerberos credentials when they login 
automatically (via PAM).

ERROR [2018-10-17 16:04:55,608] (\{Thread-2817} 
RemoteInterpreterEventPoller.java[run]:113) - Can't get RemoteInterpreterEvent
 org.apache.thrift.transport.TTransportException
 at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
 at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
 at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
 at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getEvent(RemoteInterpreterService.java:429)
 at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getEvent(RemoteInterpreterService.java:417)
 at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterEventPoller.run(RemoteInterpreterEventPoller.java:110)
 ERROR [2018-10-17 16:04:55,620] (\{Thread-2819} 
JobProgressPoller.java[run]:54) - Can not get or update progress
 org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Connection reset

at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:500)
 at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:121)
 at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:333)
 at 
org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:51)
 Caused by: org.apache.thrift.transport.TTransportException: 
java.net.SocketException: Connection reset
 at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
 at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
 at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
 at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
 at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
 at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpreterService.java:313)
 at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterService.java:298)
 at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:497)
 ... 3 more

Caused by: java.net.SocketException: Connection reset

        at java.net.SocketInputStream.read(SocketInputStream.java:209)

        at java.net.SocketInputStream.read(SocketInputStream.java:141)

        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)

        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)

        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)

        at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)

        ... 11 more

ERROR [2018-10-17 16:04:55,618] (\{pool-2-thread-37} Job.java[run]:188) - Job 
failed

org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.thrift.transport.TTransportException

        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:426)

        at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:101)

        at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:410)

        at org.apache.zeppelin.scheduler.Job.run(Job.java:175)

        at 
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)

        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

Caused by: org.apache.thrift.transport.TTransportException

        at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)

        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)

        at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)

        at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)

        at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)

        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)

        at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:266)

        at 
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:250)

        at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:398)

        ... 11 more

ERROR [2018-10-17 16:04:55,625] (\{pool-2-thread-37} 
RemoteScheduler.java[getStatus]:256) - Can't get status information

org.apache.zeppelin.interpreter.InterpreterException: 
org.apache.thrift.transport.TTransportException: java.net.ConnectException: 
Connection refu\

sed (Connection refused)

I think at this point it's repeating.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to