GitHub user egorklimov opened a pull request: https://github.com/apache/zeppelin/pull/3165
[ZEPPELIN-3704] Scheduler.getJobsRunning() returns finished jobs ### What is this PR for? Sometimes, when cron configured with active "After execution stop the interpreter" setting, last paragraphs marks as ABORT with no reason. I found out that reason of this behavior is that Scheduler.getJobsRunning() returns finished jobs. (faced this problem in 0.8, but seems that the same bug could be in 0.9) Short log (with additional log info from TinkoffCreditSystems fork): ``` INFO [2018-08-10 00:08:00,000] ({DefaultQuartzScheduler_Worker-47} Notebook.java[execute]:945) - Start schedule run note: 2C68U586U, cronExpr:"0 8 0 * * ?" INFO [2018-08-10 00:08:00,047] ({pool-2-thread-266} SchedulerFactory.java[jobStarted]:109) - Job 20170814-171621_1685490119 started by scheduler INFO [2018-08-10 00:10:35,387] ({pool-2-thread-266} SchedulerFactory.java[jobFinished]:115) - Job 20170814-171621_1685490119 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpreter-greenplum_pd:user:2C68U586U-shared_session INFO [2018-08-10 00:10:35,417] ({pool-2-thread-3838} SchedulerFactory.java[jobStarted]:109) - Job 20180402-171122_400058927 started by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpreter-spark:user:2C68U586U-shared_session INFO [2018-08-10 00:11:57,428] ({pool-2-thread-3838} SchedulerFactory.java[jobFinished]:115) - Job 20180402-171122_400058927 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpreter-spark:user:2C68U586U-shared_session INFO [2018-08-10 00:11:57,445] ({pool-2-thread-996} SchedulerFactory.java[jobStarted]:109) - Job 20180413-191933_1545337614 started by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpreter-spark:user:2C68U586U-shared_session INFO [2018-08-10 00:11:57,527] ({pool-2-thread-996} NotebookServer.java[afterStatusChange]:2631) - Job 20180413-191933_1545337614 is finished successfully, status: FINISHED INFO [2018-08-10 00:11:57,547] ({DefaultQuartzScheduler_Worker-47} Paragraph.java[execute]:343) - skip to run blank paragraph. 20180423-134725_1702290212 INFO [2018-08-10 00:11:57,547] ({DefaultQuartzScheduler_Worker-47} Notebook.java[execute]:947) - End schedule run note: 2C68U586U INFO [2018-08-10 00:11:57,548] ({DefaultQuartzScheduler_Worker-47} ManagedInterpreterGroup.java[close]:100) - Close Session: shared_session for interpreter setting: spark INFO [2018-08-10 00:11:57,553] ({pool-2-thread-996} VFSNotebookRepo.java[save]:196) - Saving note:2C68U586U Third job status from FINISHED becomes ABORT WARN [2018-08-10 00:11:57,555] ({DefaultQuartzScheduler_Worker-47} NotebookServer.java[afterStatusChange]:2633) - Job 20180413-191933_1545337614 is finished, status: ABORT, exception: null, result: %text 'sometext' INFO [2018-08-10 00:11:57,577] ({pool-2-thread-996} SchedulerFactory.java[jobFinished]:115) - Job 20180413-191933_1545337614 finished by scheduler org.apache.zeppelin.interpreter.remote.RemoteInterpreter-spark:user:2C68U586U-shared_session INFO [2018-08-10 00:11:57,585] ({DefaultQuartzScheduler_Worker-47} ManagedInterpreterGroup.java[close]:130) - Job paragraph_1523636373190_-1466164905 aborted ``` ### What type of PR is it? Bug Fix ### What is the Jira issue? * Issue: https://issues.apache.org/jira/browse/ZEPPELIN-3704 ### How should this be tested? * CI failed on 4th test https://travis-ci.org/TinkoffCreditSystems/zeppelin/builds/421046867 with: ``` Tests in error: PersonalizeActionsIT.testSimpleAction:152->AbstractZeppelinIT.waitForParagraph:73->AbstractZeppelinIT.pollingWait:99 â¬â Timeout PersonalizeActionsIT.testGraphAction:194->AbstractZeppelinIT.pollingWait:99 â¬â Timeout PersonalizeActionsIT.testDynamicFormAction:267->AbstractZeppelinIT.pollingWait:99 â¬â Timeout Tests run: 29, Failures: 0, Errors: 3, Skipped: 0 ``` Seems that smth went wrong with Travis. * Tested in TinkoffCreditSystems fork, new log: ``` NFO [2018-08-27 04:00:00,001] ({DefaultQuartzScheduler_Worker-30} Notebook.java[execute]:947) - Start schedule run note: 2DJUZ2HJX, cronExpr:"0 0 0/1 * * ?" ... INFO [2018-08-27 04:00:11,619] ({DefaultQuartzScheduler_Worker-30} Notebook.java[execute]:949) - End schedule run note: 2DJUZ2HJX INFO [2018-08-27 04:00:11,619] ({DefaultQuartzScheduler_Worker-30} ManagedInterpreterGroup.java[close]:100) - Close Session: shared_session for interpreter setting: spark ERROR [2018-08-27 04:00:11,619] ({DefaultQuartzScheduler_Worker-30} RemoteScheduler.java[getJobsRunning]:138) - Tried to add paragraph_1532602460612_1917281840 to list of running jobs, but job status is FINISHED ERROR [2018-08-27 04:00:11,619] ({DefaultQuartzScheduler_Worker-30} RemoteScheduler.java[getJobsRunning]:138) - Tried to add paragraph_1532602460620_1914203849 to list of running jobs, but job status is FINISHED WARN [2018-08-27 04:00:11,619] ({DefaultQuartzScheduler_Worker-30} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.spark.SparkInterpreter WARN [2018-08-27 04:00:11,619] ({DefaultQuartzScheduler_Worker-30} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.spark.SparkSqlInterpreter WARN [2018-08-27 04:00:11,619] ({DefaultQuartzScheduler_Worker-30} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.spark.DepInterpreter WARN [2018-08-27 04:00:11,627] ({DefaultQuartzScheduler_Worker-30} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.spark.IPySparkInterpreter WARN [2018-08-27 04:00:11,653] ({DefaultQuartzScheduler_Worker-30} RemoteInterpreter.java[close]:199) - close is called when RemoterInterpreter is not opened for org.apache.zeppelin.spark.SparkRInterpreter INFO [2018-08-27 04:00:11,655] ({DefaultQuartzScheduler_Worker-30} ManagedInterpreterGroup.java[close]:105) - Remove this InterpreterGroup: spark:user:2DJUZ2HJX as all the sessions are closed INFO [2018-08-27 04:00:11,655] ({DefaultQuartzScheduler_Worker-30} ManagedInterpreterGroup.java[close]:108) - Kill RemoteInterpreterProcess INFO [2018-08-27 04:00:11,661] ({DefaultQuartzScheduler_Worker-30} RemoteInterpreterManagedProcess.java[stop]:220) - Kill interpreter process WARN [2018-08-27 04:00:14,188] ({DefaultQuartzScheduler_Worker-30} RemoteInterpreterManagedProcess.java[stop]:230) - ignore the exception when shutting down INFO [2018-08-27 04:00:14,191] ({DefaultQuartzScheduler_Worker-30} RemoteInterpreterManagedProcess.java[stop]:238) - Remote process terminated ``` ### Screenshots (if appropriate) ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No You can merge this pull request into a Git repository by running: $ git pull https://github.com/TinkoffCreditSystems/zeppelin ZEPPELIN-3704 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zeppelin/pull/3165.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3165 ---- commit 40e9ff1411f2e4c4db182522c990e2bee8c5fab2 Author: egorklimov <klim.electronicmail@...> Date: 2018-08-13T08:50:42Z RemoteScheduler updated ---- ---