emanhthangngot opened a new pull request, #55899:
URL: https://github.com/apache/spark/pull/55899

   ### What changes were proposed in this pull request?
   
   This PR optimizes pandas-on-Spark `DataFrame.diff(axis=0)` and 
`Series.diff()` to avoid using an unpartitioned Spark Window.
   
   The new implementation range-partitions by the natural order column, 
computes pandas `diff()` within each Spark partition, and exchanges only the 
boundary rows needed to preserve correctness across partition boundaries. It 
also keeps the existing grouped `diff()` path unchanged.
   
   Additional tests cover:
   - absence of a Window in the analyzed plan for `DataFrame.diff()`
   - empty DataFrames
   - MultiIndex rows
   - null values
   - single-partition execution
   - zero, negative, and large periods
   - cross-partition boundary rows
   - `Series.diff()` delegation
   
   ### Why are the changes needed?
   
   `DataFrame.diff(axis=0)` currently delegates to `Series._diff()` without a 
partition specification. This creates a Spark Window over the whole DataFrame 
ordered by the natural order column, which can force all data into a single 
partition and cause scaling issues for large datasets.
   
   This change removes that unpartitioned Window from the 
`DataFrame.diff(axis=0)` / `Series.diff()` path while preserving 
pandas-compatible positional diff semantics, including rows at partition 
boundaries.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. `DataFrame.diff(axis=0)` and `Series.diff()` now avoid the previous 
unpartitioned Window execution path. The intended result values are unchanged.
   
   ### How was this patch tested?
   
   Ran:
   
   ```
   python/run-tests --python-executables .venv/bin/python --testnames 
pyspark.pandas.tests.computation.test_compute
   ```
   
   The test was run from a temporary path without spaces because the local 
checkout path contains spaces and Spark's Java launcher fails to start from 
that path.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Codex (GPT-5)
   
   Codex was used to help inspect the existing implementation, identify the 
unpartitioned Window path, refine the patch, and prepare tests. The final 
changes were reviewed and validated by the author.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to