andreAmorimF opened a new pull request, #55689:
URL: https://github.com/apache/spark/pull/55689

   ### What changes were proposed in this pull request?
   
   Add `Dataset.getNumPartitions: Int` to the Spark Connect client (Scala and 
Python), matching the semantics of `df.rdd.getNumPartitions` in classic Spark.
   
   In classic Spark users call `df.rdd.getNumPartitions` to get the number of 
partitions a DataFrame will produce during execution. Spark Connect's 
client-side `Dataset` does not expose `.rdd`, so this operation was 
unavailable. This PR adds `Dataset.getNumPartitions` directly on the Connect 
client `Dataset`.
   
   **Design choices:**
   - **Method name**: `getNumPartitions` — matches classic RDD API, aids 
migration
   - **Placement**: directly on `Dataset` — `df.getNumPartitions`
   - **Server implementation**: `executedPlan.execute().getNumPartitions` — 
triggers physical planning (file splits, bucket assignments, etc.) and RDD 
partition construction, but no data scan. This is the same work classic Spark 
does for `rdd.getNumPartitions`
   - **Protocol**: new `AnalyzePlan` case (field 19 in request, 17 in response) 
— consistent with `isLocal`, `isStreaming`, `inputFiles`
   
   `outputPartitioning.numPartitions` was considered as a lighter alternative 
but rejected: the default `SparkPlan.outputPartitioning` returns 
`UnknownPartitioning(0)` for any operator that does not override it (including 
`ExistingRDD`, `Expand`, full-outer `SortMergeOuterJoin`, and all third-party 
operators), which would silently return `0` in those cases.
   
   ### Why are the changes needed?
   
   `df.rdd.getNumPartitions` is a commonly used operation in classic Spark for 
understanding physical partitioning without triggering data scanning. Spark 
Connect clients have no equivalent, making it harder to migrate workloads from 
classic Spark to Spark Connect.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — adds a new method `getNumPartitions(): Int` on `Dataset` in both the 
Scala and Python Spark Connect clients.
   
   ### How was this patch tested?
   
   - **Server unit test**: Added assertion to `SparkConnectServiceSuite` ("Test 
schema in analyze response") that validates the full server-side handler path 
using an embedded SparkSession: `repartition(4).getNumPartitions === 4`.
   - **E2E test**: Added assertions to `ClientE2ETestSuite` ("Dataset 
inspection"): `df.repartition(4).getNumPartitions === 4` and 
`df.coalesce(1).getNumPartitions === 1`. These exercise the full client → gRPC 
→ server → response round-trip.
   - **Python test**: Added `test_get_num_partitions` to 
`test_connect_basic.py`.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes, co-authored with Claude (Anthropic).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to