itholic opened a new pull request, #40525:
URL: https://github.com/apache/spark/pull/40525
### What changes were proposed in this pull request?
This PR proposes to support pandas API on Spark for Spark Connect. This PR
includes minimal changes to support basic functionality of the pandas API in
Spark Connect, and sets up a testing environment into
`pyspark/pandas/tests/connect` using all existing pandas API on Spark test
bases to test the functionality of the pandas API on Spark in a remote Spark
session.
### Why are the changes needed?
By supporting the pandas API in Spark Connect, it can significantly improve
the usability for existing PySpark and pandas users.
### Does this PR introduce _any_ user-facing change?
It is designed to allow existing code for regular Spark sessions to be used
without any user-facing changes. However, since some features of the existing
pandas API on Spark are not fully supported, some features may be limited.
### How was this patch tested?
A testing bed for Spark Connect has been set up to reproduce all existing
tests for Spark Connect, ensuring that the existing tests can be replicated in
Spark Connect. The current result for all tests as below:
| Test file | Test total | Test
passed | Coverage |
| --------------------------------------------------- | ---------- |
----------- | -------- |
| test_parity_dataframe.py | 105 | 85
| 80.95% |
| test_parity_dataframe_slow.py | 66 | 48
| 72.73% |
| test_parity_dataframe_conversion.py | 11 | 11
| 100.00% |
| test_parity_dataframe_spark_io.py | 8 | 7
| 87.50% |
| test_parity_ops_on_diff_frames.py | 75 | 75
| 100.00% |
| test_parity_series.py | 131 | 104
| 79.39% |
| test_parity_series_datetime.py | 41 | 34
| 82.93% |
| test_parity_categorical.py | 29 | 22
| 75.86% |
| test_parity_config.py | 7 | 7
| 100.00% |
| test_parity_csv.py | 18 | 18
| 100.00% |
| test_parity_default_index.py | 4 | 1
| 25.00% |
| test_parity_ewm.py | 3 | 1
| 33.33% |
| test_parity_expanding.py | 22 | 2
| 9.09% |
| test_parity_extention.py | 7 | 7
| 100.00% |
| test_parity_frame_spark.py | 6 | 2
| 33.33% |
| test_parity_generic_functions.py | 4 | 1
| 25.00% |
| test_parity_groupby.py | 49 | 36
| 73.47% |
| test_parity_groupby_slow.py | 205 | 147
| 71.71% |
| test_parity_indexing.py | 3 | 3
| 100.00% |
| test_parity_indexops_spark.py | 3 | 3
| 100.00% |
| test_parity_internal.py | 1 | 0
| 0.00% |
| test_parity_namespace.py | 29 | 26
| 89.66% |
| test_parity_numpy_compat.py | 6 | 4
| 66.67% |
| test_parity_ops_on_diff_frames_groupby.py | 22 | 13
| 59.09% |
| test_parity_ops_on_diff_frames_groupby_expanding.py | 7 | 0
| 0.00% |
| test_parity_ops_on_diff_frames_groupby_rolling.py | 7 | 0
| 0.00% |
| test_parity_ops_on_diff_frames_slow.py | 22 | 15
| 68.18% |
| test_parity_repr.py | 5 | 5
| 100.00% |
| test_parity_resample.py | 5 | 3
| 60.00% |
| test_parity_reshape.py | 10 | 8
| 80.00% |
| test_parity_rolling.py | 21 | 1
| 4.76% |
| test_parity_scalars.py | 1 | 1
| 100.00% |
| test_parity_series_conversion.py | 2 | 2
| 100.00% |
| test_parity_series_string.py | 56 | 55
| 98.21% |
| test_parity_spark_functions.py | 1 | 1
| 100.00% |
| test_parity_sql.py | 7 | 4
| 57.14% |
| test_parity_stats.py | 15 | 7
| 46.67% |
| test_parity_typedef.py | 10 | 10
| 100.00% |
| test_parity_utils.py | 5 | 5
| 100.00% |
| test_parity_window.py | 2 | 2
| 100.00% |
| test_parity_frame_plot.py | 7 | 5
| 71.43% |
| plot/test_parity_frame_plot_matplotlib.py | 13 | 11
| 84.62% |
| plot/test_parity_frame_plot_plotly.py | 12 | 9
| 75.00% |
| plot/test_parity_series_plot.py | 3 | 3
| 100.00% |
| plot/test_parity_series_plot_matplotlib.py | 14 | 8
| 57.14% |
| plot/test_parity_series_plot_plotly.py | 9 | 7
| 77.78% |
| indexes/test_parity_base.py | 144 | 75
| 52.08% |
| indexes/test_parity_category.py | 16 | 7
| 43.75% |
| indexes/test_parity_datetime.py | 13 | 11
| 84.62% |
| indexes/test_parity_timedelta.py | 2 | 1
| 50.00% |
| data_type_ops/test_parity_base.py | 2 | 2
| 100.00% |
| data_type_ops/test_parity_binary_ops.py | 30 | 25
| 83.33% |
| data_type_ops/test_parity_boolean_ops.py | 31 | 26
| 83.87% |
| data_type_ops/test_parity_categorical_ops.py | 30 | 23
| 76.67% |
| data_type_ops/test_parity_complex_ops.py | 30 | 30
| 100.00% |
| data_type_ops/test_parity_date_ops.py | 30 | 25
| 83.33% |
| Total | 1417 | 1044
| 73.68% |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]