[GitHub] [spark] ankurdave commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

GitBox Tue, 03 May 2022 08:44:33 -0700


ankurdave commented on PR #36425:
URL: https://github.com/apache/spark/pull/36425#issuecomment-1116249393

From talking to @sadikovi, it sounds like the use-after-free that caused
this crash does in fact occur in the Python writer thread, not the main task
thread. And since `RDD#isEmpty()` is implemented using `limit(1)`, this is a
very similar situation as the one described in
https://github.com/apache/spark/pull/34245. The main difference appears to be
the presence of a group-by with codegen enabled.

Given that, the question is why https://github.com/apache/spark/pull/34245
was not sufficient to fix this. I'm guessing [the task completion listener that
frees the off-heap
memory](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L489)
is being registered in the wrong order relative to the BasePythonRunner task
completion listener.

Anyway, even if that were fixed I think we would still need the fix in this
PR for performance reasons. Otherwise the writer thread could read an arbitrary
amount of data before checking the interrupt status.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ankurdave commented on pull request #36425: [SPARK-39084][PYSPARK] Fix df.rdd.isEmpty() by using TaskContext to stop iterator on task completion

Reply via email to