tonghuaroot opened a new pull request, #56359:
URL: https://github.com/apache/spark/pull/56359

   ### What changes were proposed in this pull request?
   
   This PR enables `DataFrame.combine` for pandas-on-Spark through the
   `compute.pandas_fallback` path. Previously `combine` was declared via
   `_unsupported_function`, so it always raised `PandasNotImplementedError`.
   This PR adds a `_combine_fallback` method to
   `pyspark.pandas.frame.DataFrame`, mirroring the existing
   `_asof_fallback` / `_set_axis_fallback` sibling methods, so that
   `__getattr__` dispatches `combine` through the generic
   `_build_fallback_method` when the fallback option is enabled.
   
   It also adds tests covering both the disabled (raises
   `PandasNotImplementedError`) and the fallback-enabled behavior, plus the
   Spark Connect parity test, and registers them in
   `dev/sparktestsupport/modules.py`.
   
   JIRA: https://issues.apache.org/jira/browse/SPARK-57294
   
   ### Why are the changes needed?
   
   `combine` is a useful pandas DataFrame API that was unsupported on
   pandas-on-Spark even when users opted into `compute.pandas_fallback`.
   It is a sound fallback candidate for the same reasons as the existing
   asof / set_axis fallbacks: its result is an ordinary single-level-index
   DataFrame whose column dtypes (for example int64) map cleanly onto Spark
   types, so the generic fallback round-trip through
   `ps.from_pandas` / `as_spark_type` succeeds. Wiring it through fallback
   closes a gap in the pandas-on-Spark fallback coverage and gives users an
   explicit, opt-in way to run `combine`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. With `compute.pandas_fallback` enabled, calling
   `DataFrame.combine` on a pandas-on-Spark DataFrame now executes via the
   pandas fallback path and returns a result instead of raising
   `PandasNotImplementedError`. A `PandasAPIOnSparkAdviceWarning` is emitted
   to indicate the call ran in fallback mode. When the option is disabled
   (the default), the behavior is unchanged and `PandasNotImplementedError`
   is still raised.
   
   ### How was this patch tested?
   
   Added `pyspark.pandas.tests.frame.test_combine` and the Spark Connect
   parity test `pyspark.pandas.tests.connect.frame.test_parity_combine`,
   both registered in `dev/sparktestsupport/modules.py`. The classic test
   covers two cases:
   
   - `test_disabled`: without `compute.pandas_fallback`, `combine` raises
     `PandasNotImplementedError`.
   - `test_fallback`: with the option enabled, `combine` (including the
     `overwrite=False` case) produces results equal to pandas, asserted
     with `assert_eq` (values and dtypes).
   
   Ran `test_combine` against a real local SparkSession:
   
   ```
   $ python -m pytest python/pyspark/pandas/tests/frame/test_combine.py -v
   collected 4 items
   ... test_assert_classic_mode PASSED
   ... CombineTests::test_assert_classic_mode PASSED
   ... CombineTests::test_disabled PASSED
   ... CombineTests::test_fallback PASSED
   4 passed in 11.32s
   ```
   
   Environment: PySpark master (based on commit c082f824), pandas 2.2.3,
   PyArrow as bundled, OpenJDK 17.0.18, Python 3.11. The
   `PandasAPIOnSparkAdviceWarning: combine is executed in fallback mode`
   message confirms the call exercised the fallback path.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude (Anthropic) Opus 4.8
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to