[GitHub] [spark] aokolnychyi commented on pull request #32921: [WIP][SPARK-35779][SQL] Dynamic filtering for Data Source V2

GitBox Wed, 23 Jun 2021 13:39:40 -0700


aokolnychyi commented on pull request #32921:
URL: https://github.com/apache/spark/pull/32921#issuecomment-867141618



   I did update the PR to address the point about output partitioning. There 
are still a few comments I need to fix (will do soon).
   
   I think allowing data sources to change the output partitioning as long as 
it does NOT introduce a shuffle is a good idea. After thinking more about it, 
passing the required distribution in `filter` to achieve that will complicate 
the underlying logic in connectors. Such an API will be quite hard to use. 
Instead, Spark can request connectors to always keep the original distribution 
during runtime filtering and can pass a flag to indicate if the number of tasks 
can be changed. Essentially, instead of passing the required distribution and 
letting connectors interpret that, we can just tell whether it is safe to 
change the number of tasks.
   
   Let me know what everybody thinks. The current logic will never introduce 
new shuffles and is very close to the existing approach for v1 tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] aokolnychyi commented on pull request #32921: [WIP][SPARK-35779][SQL] Dynamic filtering for Data Source V2

Reply via email to