Re: [PR] [SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client [spark]

via GitHub Wed, 20 May 2026 06:38:25 -0700


hvanhovell commented on PR #55689:
URL: https://github.com/apache/spark/pull/55689#issuecomment-4498959540


   @andreAmorimF I don't think this should be added as a Spark Connect API for 
a couple of reasons:
   - Spark Connect is supposed to be engine agnostic. Leaking execution details 
into the API is not really desirable.
   - AFAICT the only reason why this would be needed is because you want to 
modify parallelism at some stage in the plan. At the end of the day this should 
be an engine problem, and we should try to fix it there.
   - You argue that there won't be any datascanning. That is unfortunately not 
true. With Adaptive Query Execution enabled any query that contains shuffles 
will actually materialize all shuffles in the tree when you call 
getNumPartitions. If you combine this with Connects' lazy nature (we rebuild 
the entire plan), you are effectively going to rerun the same query twice.
   - Connects' lazy nature makes this feature not as accurate as you might 
expect. Since the plan is rebuild from scratch, the same dataframes can have 
different plans (with different parallelism), when things in the underlying 
data change or when the session state change (e.g. different conf). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56726][CONNECT] Add Dataset.getNumPartitions to Spark Connect client [spark]

Reply via email to