juliuszsompolski commented on code in PR #43955:
URL: https://github.com/apache/spark/pull/43955#discussion_r1401923866
##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ExecutePlanResponseReattachableIterator.scala:
##########
@@ -101,7 +101,20 @@ class ExecutePlanResponseReattachableIterator(
// throw error on first iter.hasNext() or iter.next()
// Visible for testing.
private[connect] var iter:
Option[java.util.Iterator[proto.ExecutePlanResponse]] =
- Some(rawBlockingStub.executePlan(initialRequest))
+ Some(makeLazyIter(rawBlockingStub.executePlan(initialRequest)))
+
+ // Creates a request that contains the query and returns a stream of
`ExecutePlanResponse`.
+ // After upgrading gRPC from 1.56.0 to 1.59.3, it makes the first request
when
+ // the stream is created, but here the code here assumes that no request is
made before
+ // that, see also SPARK-46042
+ private def makeLazyIter(f: => java.util.Iterator[proto.ExecutePlanResponse])
+ : java.util.Iterator[proto.ExecutePlanResponse] = {
+ new java.util.Iterator[proto.ExecutePlanResponse] {
+ private lazy val internalIter = f
+ override def hasNext: Boolean = internalIter.hasNext
+ override def next(): proto.ExecutePlanResponse = internalIter.next
+ }
+ }
Review Comment:
I think it's an unnecessary hack to preserve a behaviour that wasn't really
desired in the first place.
As written above, I'd rather just embrace the change. The only public API
impact is on `Dataset.toLocalIterator` now eagerly submitting query to the
server.
There is an existing difference in Spark Connect, and non-Spark Connect
`toLocalIterator`.
In non Spark Connect, `Dataset.toLocalIterator` will proceed to lazily
execute the query, task by task (each task submitted as a separate Spark Job)
as the iterator is consumed and more data is needed; this will be blocking
every time a new task is needed, as it will only be triggered when the data
from it is needed.
In Spark Connect, the whole query will be submitted and executed just like
normally. The change to do it when `Dataset.toLocalIterator` is first called as
opposed when the iterator is opened for the first time is not significant
compared to the existing difference outlined above.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]