[GitHub] [spark] sunchao commented on a change in pull request #35613: [SPARK-38273][SQL] `decodeUnsafeRows`'s iterators should close underlying input streams

GitBox Tue, 22 Feb 2022 20:17:09 -0800


sunchao commented on a change in pull request #35613:
URL: https://github.com/apache/spark/pull/35613#discussion_r812546706




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
##########
@@ -384,17 +385,31 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging with Serializ
     val bis = new ByteArrayInputStream(bytes)
     val ins = new DataInputStream(codec.compressedInputStream(bis))
 
-    new Iterator[InternalRow] {
+    new NextIterator[InternalRow] {
+      Option(TaskContext.get()).foreach(_.addTaskCompletionListener[Unit](_ => 
closeIfNeeded()))
       private var sizeOfNextRow = ins.readInt()

Review comment:
       I think one problem is that `Dataset.toLocalIterator` indirectly uses 
`decodeUnsafeRows`:
   ```scala
   Dataset:
     def toLocalIterator(): java.util.Iterator[T] = {
       withAction("toLocalIterator", queryExecution) { plan =>
         val fromRow = resolvedEnc.createDeserializer()
         plan.executeToIterator().map(fromRow).asJava
       }
     }
   
   SparkPlan:
     def executeToIterator(): Iterator[InternalRow] = {
       getByteArrayRdd().map(_._2).toLocalIterator.flatMap(decodeUnsafeRows)
     }
   ```
   since the iterator is transferred to clients after 
`Dataset.toLocalIterator`, there's no way for Spark to know how the iterator 
will be used, and whether it will be completely drained. Therefore, it seems 
impossible to know when we should call `close` on the input stream.
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sunchao commented on a change in pull request #35613: [SPARK-38273][SQL] `decodeUnsafeRows`'s iterators should close underlying input streams

Reply via email to