Re: [PR] [SPARK-46042][CONNECT] Reenable a `releaseSession` test case in SparkConnectServiceE2ESuite [spark]

via GitHub Wed, 22 Nov 2023 03:44:56 -0800


juliuszsompolski commented on code in PR #43955:
URL: https://github.com/apache/spark/pull/43955#discussion_r1401923866



##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala:
##########
@@ -101,7 +101,20 @@ class ExecutePlanResponseReattachableIterator(
   // throw error on first iter.hasNext() or iter.next()
   // Visible for testing.
   private[connect] var iter: 
Option[java.util.Iterator[proto.ExecutePlanResponse]] =
-    Some(rawBlockingStub.executePlan(initialRequest))
+    Some(makeLazyIter(rawBlockingStub.executePlan(initialRequest)))
+
+  // Creates a request that contains the query and returns a stream of 
`ExecutePlanResponse`.
+  // After upgrading gRPC from 1.56.0 to 1.59.3, it makes the first request 
when
+  // the stream is created, but here the code here assumes that no request is 
made before
+  // that, see also SPARK-46042
+  private def makeLazyIter(f: => java.util.Iterator[proto.ExecutePlanResponse])
+      : java.util.Iterator[proto.ExecutePlanResponse] = {
+    new java.util.Iterator[proto.ExecutePlanResponse] {
+      private lazy val internalIter = f
+      override def hasNext: Boolean = internalIter.hasNext
+      override def next(): proto.ExecutePlanResponse = internalIter.next
+    }
+  }

Review Comment:
   I think it's an unnecessary hack to preserve a behaviour that wasn't really 
desired in the first place.
   As written above, I'd rather just embrace the change. The only public API 
impact is on `Dataset.toLocalIterator` now eagerly submitting query to the 
server.
   
   There is an existing difference in Spark Connect, and non-Spark Connect 
`toLocalIterator`.
   In non Spark Connect, `Dataset.toLocalIterator` will proceed to lazily 
execute the query, task by task (each task submitted as a separate Spark Job) 
as the iterator is consumed and more data is needed; this will be blocking 
every time a new task is needed, as it will only be triggered when the data 
from it is needed.
   In Spark Connect, the whole query will be submitted and executed just like 
normally. The change to do it when `Dataset.toLocalIterator` is first called as 
opposed when the iterator is opened for the first time is not significant 
compared to the existing difference outlined above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-46042][CONNECT] Reenable a `releaseSession` test case in SparkConnectServiceE2ESuite [spark]

Reply via email to