juliuszsompolski commented on PR #42331:
URL: https://github.com/apache/spark/pull/42331#issuecomment-1665727848

   > I am just wondering if at this point using the GRPC iterators is the 
easiest? Would it be easier to use a stream observer instead?
   
   That would imply changing the whole execution model in the client from pull 
based to push based...
   While the only place where we expose a pull based iterator in an external 
facing API is Dataset.toLocalIterator, all the internal implementation if pull 
based currently, so practically the whole client would need to be refactored...
   Maybe push based would make it easier to ensure that our internal consumer 
pushes everything out... but that would also introduce complications like e.g. 
doing some flow control of the push, which ultimately could again lead to 
blocking and not consuming... I see this also as quite complicated.
   In every place except Dataset.toLocalIterator we control the full lifecycle 
of the pull based iterator. With the CloseableIterator we can now ensure that 
this gets closed...
   There is however indeed a possible followup. We handle consuming the 
iterator, and we handle closing the ExecutePlanResponseReattachableIterator 
after errors from server. But we should also audit if we could do better 
closing if other errors, including interrupt are thrown. I raised  
https://issues.apache.org/jira/browse/SPARK-44676 for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to