Hi!

We found the reason why this error is happening. It seems to be related
with the solution
<https://github.com/apache/zeppelin/commit/9f22db91c279b7daf6a13b2d805a874074b070fd>
for the task ZEPPELIN-2075
<https://issues.apache.org/jira/browse/ZEPPELIN-2075>.

This solution is causing that when one particular user cancels his py-spark
job, the py-spark jobs from *all the users are being canceled !!*.

When a py-spark job is cancelled, the method PySparkInterpreter interrupt()
is invoked, and then the SIGINT event is called, causing that all the jobs
in the same spark context be cancelled:

context.py:

# create a signal handler which would be invoked on receiving SIGINT
def signal_handler(signal, frame):
    *self.cancelAllJobs()*
    raise KeyboardInterrupt()

Is this a zeppelin bug ?

Thank you.


2018-06-12 17:12 GMT-05:00 Jhon Anderson Cardenas Diaz <
jhonderson2...@gmail.com>:

> Hi!
>
> We found the reason why this error is happening. It seems to be related
> with the solution
> <https://github.com/apache/zeppelin/commit/9f22db91c279b7daf6a13b2d805a874074b070fd>
> for the task ZEPPELIN-2075
> <https://issues.apache.org/jira/browse/ZEPPELIN-2075>.
>
> This solution is causing that when one particular user cancels his
> py-spark job, the py-spark jobs from all the users are being canceled.
>
> When a py-spark job is cancelled, the method PySparkInterpreter
> interrupt() is invoked, and then the SIGINT
>
> context.py:
>
> # create a signal handler which would be invoked on receiving SIGINT
> def signal_handler(signal, frame):
>     self.cancelAllJobs()
>     raise KeyboardInterrupt()
>
>
> 2018-06-12 9:26 GMT-05:00 Jhon Anderson Cardenas Diaz <
> jhonderson2...@gmail.com>:
>
>> Hi!.
>> I have 0.8.0 version, from September  2017
>>
>> 2018-06-12 4:48 GMT-05:00 Jianfeng (Jeff) Zhang <jzh...@hortonworks.com>:
>>
>>>
>>> Which version do you use ?
>>>
>>>
>>> Best Regard,
>>> Jeff Zhang
>>>
>>>
>>> From: Jhon Anderson Cardenas Diaz <jhonderson2...@gmail.com<mailto:
>>> jhonderson2...@gmail.com>>
>>> Reply-To: "us...@zeppelin.apache.org<mailto:us...@zeppelin.apache.org>"
>>> <us...@zeppelin.apache.org<mailto:us...@zeppelin.apache.org>>
>>> Date: Friday, June 8, 2018 at 11:08 PM
>>> To: "us...@zeppelin.apache.org<mailto:us...@zeppelin.apache.org>" <
>>> us...@zeppelin.apache.org<mailto:us...@zeppelin.apache.org>>, "
>>> dev@zeppelin.apache.org<mailto:dev@zeppelin.apache.org>" <
>>> dev@zeppelin.apache.org<mailto:dev@zeppelin.apache.org>>
>>> Subject: All PySpark jobs are canceled when one user cancel his PySpark
>>> paragraph (job)
>>>
>>> Dear community,
>>>
>>> Currently we are having problems with multiple users running paragraphs
>>> associated with pyspark jobs.
>>>
>>> The problem is that if an user aborts/cancels his pyspark paragraph
>>> (job), the active pyspark jobs of the other users are canceled too.
>>>
>>> Going into detail, I've seen that when you cancel a user's job this
>>> method is invoked (which is fine):
>>>
>>> sc.cancelJobGroup("zeppelin-[notebook-id]-[paragraph-id]")
>>>
>>> But somehow unknown to me, this method is also invoked:
>>>
>>> sc.cancelAllJobs()
>>>
>>> The above is due to the trace of the log that appears in the jobs of the
>>> other users:
>>>
>>> Py4JJavaError: An error occurred while calling o885.count.
>>> : org.apache.spark.SparkException: Job 461 cancelled as part of
>>> cancellation of all jobs
>>> at org.apache.spark.scheduler.DAGScheduler.org<http://org.apach
>>> e.spark.scheduler.DAGScheduler.org>$apache$spark$scheduler$D
>>> AGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>>> at org.apache.spark.scheduler.DAGScheduler.handleJobCancellatio
>>> n(DAGScheduler.scala:1375)
>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAll
>>> Jobs$1.apply$mcVI$sp(DAGScheduler.scala:721)
>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAll
>>> Jobs$1.apply(DAGScheduler.scala:721)
>>> at org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAll
>>> Jobs$1.apply(DAGScheduler.scala:721)
>>> at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
>>> at org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGS
>>> cheduler.scala:721)
>>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOn
>>> Receive(DAGScheduler.scala:1628)
>>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe
>>> ceive(DAGScheduler.scala:1605)
>>> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe
>>> ceive(DAGScheduler.scala:1594)
>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.
>>> scala:628)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
>>> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>>> onScope.scala:151)
>>> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
>>> onScope.scala:112)
>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>>> at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
>>> at org.apache.spark.sql.execution.SparkPlan.executeCollect(Spar
>>> kPlan.scala:275)
>>> at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$D
>>> ataset$$execute$1$1.apply(Dataset.scala:2386)
>>> at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>>> nId(SQLExecution.scala:57)
>>> at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788)
>>> at org.apache.spark.sql.Dataset.org<http://org.apache.spark.sql
>>> .Dataset.org>$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2385)
>>> at org.apache.spark.sql.Dataset.org<http://org.apache.spark.sql
>>> .Dataset.org>$apache$spark$sql$Dataset$$collect(Dataset.scala:2392)
>>> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.
>>> scala:2420)
>>> at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.
>>> scala:2419)
>>> at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2801)
>>> at org.apache.spark.sql.Dataset.count(Dataset.scala:2419)
>>> at sun.reflect.GeneratedMethodAccessor120.invoke(Unknown Source)
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>> thodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>> at py4j.Gateway.invoke(Gateway.java:280)
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>> at py4j.GatewayConnection.run(GatewayConnection.java:214)
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>> (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError('An error
>>> occurred while calling o885.count.\n', JavaObject id=o886), <traceback
>>> object at 0x7f9e669ae588>)
>>>
>>> Any idea of why this could be happening?
>>>
>>> (I have 0.8.0 version from September 2017)
>>>
>>> Thank you!
>>>
>>
>>
>

Reply via email to