hvanhovell commented on PR #55689: URL: https://github.com/apache/spark/pull/55689#issuecomment-4498959540
@andreAmorimF I don't think this should be added as a Spark Connect API for a couple of reasons: - Spark Connect is supposed to be engine agnostic. Leaking execution details into the API is not really desirable. - AFAICT the only reason why this would be needed is because you want to modify parallelism at some stage in the plan. At the end of the day this should be an engine problem, and we should try to fix it there. - You argue that there won't be any datascanning. That is unfortunately not true. With Adaptive Query Execution enabled any query that contains shuffles will actually materialize all shuffles in the tree when you call getNumPartitions. If you combine this with Connects' lazy nature (we rebuild the entire plan), you are effectively going to rerun the same query twice. - Connects' lazy nature makes this feature not as accurate as you might expect. Since the plan is rebuild from scratch, the same dataframes can have different plans (with different parallelism), when things in the underlying data change or when the session state change (e.g. different conf). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
