[GitHub] advancedxy commented on a change in pull request #23638: [SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished

GitBox Mon, 28 Jan 2019 19:30:07 -0800

advancedxy commented on a change in pull request #23638: [SPARK-26713][CORE] 
Interrupt pipe IO threads in PipedRDD when task is finished
URL: https://github.com/apache/spark/pull/23638#discussion_r251680682


 ##########
 File path: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
 ##########
 @@ -141,7 +141,14 @@ final class ShuffleBlockFetcherIterator(
 
   /**
    * Whether the iterator is still active. If isZombie is true, the callback 
interface will no
-   * longer place fetched blocks into [[results]].
+   * longer place fetched blocks into [[results]] and the iterator is marked 
as fully consumed.
+   *
+   * When the iterator is inactive, [[hasNext]] and [[next]] calls will honor 
that as there are
+   * cases the iterator is still being consumed. For example, ShuffledRDD + 
PipedRDD if the
+   * subprocess command is failed. The task will be marked as failed, then the 
iterator will be
+   * cleaned up at task completion, the [[next]] call (called in the stdin 
writer thread of
+   * PipedRDD if not exited yet) may hang at [[results.take]]. The defensive 
check in [[hasNext]]
+   * and [[next]] reduces the possibility of such race conditions.
 
 Review comment:
   > And why can't `PipelinedRDD` stop consuming input? I think it's better to 
fix the solo consumer side, instead of fixing different kinds of producer sides.
   
   The `PipedRDD` stops consuming input in this PR. As for the `ShuffedRDD` + 
`PipedRDD` solely, the fixes in `PipedRDD` is sufficient. But I noticed the 
iterator still producing data is also the cause, therefore I made the 
corresponding changes.
   
   > When a task finishes, do we really need to guarantee all the iterators 
stop producing data? I agree it's better, but I'm afraid it's too much effort 
to guarantee it. Not only shuffle reader, we also need to fix the sort 
iterator, aggregate iterator and so on.
   
   I think we can try our best to guarantee that. If it's too much effort, we 
could stop trying or try different approaches.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] advancedxy commented on a change in pull request #23638: [SPARK-26713][CORE] Interrupt pipe IO threads in PipedRDD when task is finished

Reply via email to