Charles Hedrick created ZEPPELIN-3816: -----------------------------------------
Summary: after moderate usage, can no longer use Spark2 Key: ZEPPELIN-3816 URL: https://issues.apache.org/jira/browse/ZEPPELIN-3816 Project: Zeppelin Issue Type: Bug Components: zeppelin-interpreter Environment: h3. spark2 %spark2, %spark2.sql, %spark2.dep, %spark2.pyspark, %spark2.r spark ui edit restart remove h5. Option The interpreter will be instantiated Per User in isolated process. User Impersonate Connect to existing process Set permission h5. Properties ||name||value|| |SPARK_HOME|/usr/hdp/current/spark2-client/| |args| | |master|local[*]| |spark.app.name|Zeppelin| |spark.cores.max| | |spark.executor.memory| | |zeppelin.R.cmd|R| |zeppelin.R.image.width|100%| |zeppelin.R.knitr|true| |zeppelin.R.render.options|out.format = 'html', comment = NA, echo = FALSE, results = 'asis', message = F, warning = F| |zeppelin.dep.additionalRemoteRepository|spark-packages,http://dl.bintray.com/spark-packages/maven,false;| |zeppelin.dep.localrepo|local-repo| |zeppelin.interpreter.localRepo|/usr/hdp/current/zeppelin-server/local-repo/2DRMGSB7A| |zeppelin.interpreter.output.limit|102400| |zeppelin.pyspark.python|/usr/local/bin/zsparkpy| |zeppelin.spark.concurrentSQL|false| |zeppelin.spark.importImplicit|true| |zeppelin.spark.maxResult|1000| |zeppelin.spark.printREPLOutput|true| |zeppelin.spark.sql.stacktrace|false| |zeppelin.spark.useHiveContext|true| Reporter: Charles Hedrick Fix For: 0.7.3 This is Zeppelin installed as part of HDP 2.6.3.0-235 We have a Zeppelin system being used by a large class. Everything except MD is configured to run with user impersonation, isolated. Users primarily use spark2. After a while the system becomes unusable. I've been restarting once a day, but today even that wasn't enough. Once the problem occurs we get this kind of error: Restarting my interpreter doesn't help, and indeed I believe this happens to all users. Livy2 still works. Our system is kerberized. Users get Kerberos credentials when they login automatically (via PAM). ERROR [2018-10-17 16:04:55,608] (\{Thread-2817} RemoteInterpreterEventPoller.java[run]:113) - Can't get RemoteInterpreterEvent org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getEvent(RemoteInterpreterService.java:429) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getEvent(RemoteInterpreterService.java:417) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterEventPoller.run(RemoteInterpreterEventPoller.java:110) ERROR [2018-10-17 16:04:55,620] (\{Thread-2819} JobProgressPoller.java[run]:54) - Can not get or update progress org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:500) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:121) at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:333) at org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:51) Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpreterService.java:313) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterService.java:298) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:497) ... 3 more Caused by: java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:209) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 11 more ERROR [2018-10-17 16:04:55,618] (\{pool-2-thread-37} Job.java[run]:188) - Job failed org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:426) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:101) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:410) at org.apache.zeppelin.scheduler.Job.run(Job.java:175) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:266) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:250) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:398) ... 11 more ERROR [2018-10-17 16:04:55,625] (\{pool-2-thread-37} RemoteScheduler.java[getStatus]:256) - Can't get status information org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refu\ sed (Connection refused) I think at this point it's repeating. -- This message was sent by Atlassian JIRA (v7.6.3#76005)