aokolnychyi commented on pull request #32921: URL: https://github.com/apache/spark/pull/32921#issuecomment-867141618
I did update the PR to address the point about output partitioning. There are still a few comments I need to fix (will do soon). I think allowing data sources to change the output partitioning as long as it does NOT introduce a shuffle is a good idea. After thinking more about it, passing the required distribution in `filter` to achieve that will complicate the underlying logic in connectors. Such an API will be quite hard to use. Instead, Spark can request connectors to always keep the original distribution during runtime filtering and can pass a flag to indicate if the number of tasks can be changed. Essentially, instead of passing the required distribution and letting connectors interpret that, we can just tell whether it is safe to change the number of tasks. Let me know what everybody thinks. The current logic will never introduce new shuffles and is very close to the existing approach for v1 tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
