This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new b10fea96b5b [SPARK-45566][PS] Support Pandas-like testing utils for
Pandas API on Spark
b10fea96b5b is described below
commit b10fea96b5b0fd6c3623b0463d17dc583de3e995
Author: Haejoon Lee <[email protected]>
AuthorDate: Wed Oct 18 06:59:24 2023 +0900
[SPARK-45566][PS] Support Pandas-like testing utils for Pandas API on Spark
### What changes were proposed in this pull request?
This PR proposes to support utility functions `assert_frame_equal`,
`assert_series_equal`, and `assert_index_equal` in the Pandas API on Spark to
aid users in testing.
See
[pd.assert_frame_equal](https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_frame_equal.html),
[pd.assert_series_equal](https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_series_equal.html),
[pd.assert_index_equal](https://pandas.pydata.org/docs/reference/api/pandas.testing.assert_index_equal.html)
for more detail.
### Why are the changes needed?
These utility functions allow users to efficiently test the equality of
`DataFrames`, `Series`, and `Indexes` in the Pandas API on Spark. Ensuring
accurate testing helps in maintaining code quality and user trust in the
platform.
e.g.
```python
from pyspark.pandas.testing import assert_frame_equal
df1 = spark.createDataFrame([('Alice', 1), ('Bob', 2)], ["name", "age"])
df2 = spark.createDataFrame([('Alice', 1), ('Bob', 2)], ["name", "age"])
assert_frame_equal(df1, df2)
```
### Does this PR introduce _any_ user-facing change?
Yes. Users will now have access to `assert_frame_equal`,
`assert_series_equal`, `and assert_index_equal` functions for testing purposes.
### How was this patch tested?
Added doctests.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #43398 from itholic/SPARK-45566.
Authored-by: Haejoon Lee <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
.../docs/source/reference/pyspark.pandas/index.rst | 1 +
.../pyspark.pandas/{index.rst => testing.rst} | 31 +-
python/pyspark/pandas/testing.py | 328 +++++++++++++++++++++
3 files changed, 341 insertions(+), 19 deletions(-)
diff --git a/python/docs/source/reference/pyspark.pandas/index.rst
b/python/docs/source/reference/pyspark.pandas/index.rst
index 31fc95e95f1..0d45ba64b4d 100644
--- a/python/docs/source/reference/pyspark.pandas/index.rst
+++ b/python/docs/source/reference/pyspark.pandas/index.rst
@@ -38,3 +38,4 @@ This page gives an overview of all public pandas API on Spark.
resampling
ml
extensions
+ testing
diff --git a/python/docs/source/reference/pyspark.pandas/index.rst
b/python/docs/source/reference/pyspark.pandas/testing.rst
similarity index 69%
copy from python/docs/source/reference/pyspark.pandas/index.rst
copy to python/docs/source/reference/pyspark.pandas/testing.rst
index 31fc95e95f1..67589fb019a 100644
--- a/python/docs/source/reference/pyspark.pandas/index.rst
+++ b/python/docs/source/reference/pyspark.pandas/testing.rst
@@ -16,25 +16,18 @@
under the License.
-===================
-Pandas API on Spark
-===================
+.. _api.testing:
-This page gives an overview of all public pandas API on Spark.
+=======
+Testing
+=======
+.. currentmodule:: pyspark.pandas
-.. note::
- pandas API on Spark follows the API specifications of latest pandas release.
+Assertion functions
+-------------------
+.. autosummary::
+ :toctree: api/
-.. toctree::
- :maxdepth: 2
-
- io
- general_functions
- series
- frame
- indexing
- window
- groupby
- resampling
- ml
- extensions
+ testing.assert_frame_equal
+ testing.assert_series_equal
+ testing.assert_index_equal
diff --git a/python/pyspark/pandas/testing.py b/python/pyspark/pandas/testing.py
new file mode 100644
index 00000000000..49ec6081338
--- /dev/null
+++ b/python/pyspark/pandas/testing.py
@@ -0,0 +1,328 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Public testing utility functions.
+"""
+from typing import Literal, Union
+import pyspark.pandas as ps
+
+try:
+ from pyspark.sql.pandas.utils import require_minimum_pandas_version
+
+ require_minimum_pandas_version()
+ import pandas as pd
+except ImportError:
+ pass
+
+
+def assert_frame_equal(
+ left: Union[ps.DataFrame, pd.DataFrame],
+ right: Union[ps.DataFrame, pd.DataFrame],
+ check_dtype: bool = True,
+ check_index_type: Union[bool, Literal["equiv"]] = "equiv",
+ check_column_type: Union[bool, Literal["equiv"]] = "equiv",
+ check_frame_type: bool = True,
+ check_names: bool = True,
+ by_blocks: bool = False,
+ check_exact: bool = False,
+ check_datetimelike_compat: bool = False,
+ check_categorical: bool = True,
+ check_like: bool = False,
+ check_freq: bool = True,
+ check_flags: bool = True,
+ rtol: float = 1.0e-5,
+ atol: float = 1.0e-8,
+ obj: str = "DataFrame",
+) -> None:
+ """
+ Check that left and right DataFrame are equal.
+
+ This function is intended to compare two DataFrames and output any
+ differences. It is mostly intended for use in unit tests.
+ Additional parameters allow varying the strictness of the
+ equality checks performed.
+
+ .. versionadded:: 4.0.0
+
+ Parameters
+ ----------
+ left : DataFrame
+ First DataFrame to compare.
+ right : DataFrame
+ Second DataFrame to compare.
+ check_dtype : bool, default True
+ Whether to check the DataFrame dtype is identical.
+ check_index_type : bool or {'equiv'}, default 'equiv'
+ Whether to check the Index class, dtype and inferred_type
+ are identical.
+ check_column_type : bool or {'equiv'}, default 'equiv'
+ Whether to check the columns class, dtype and inferred_type
+ are identical. Is passed as the ``exact`` argument of
+ :func:`assert_index_equal`.
+ check_frame_type : bool, default True
+ Whether to check the DataFrame class is identical.
+ check_names : bool, default True
+ Whether to check that the `names` attribute for both the `index`
+ and `column` attributes of the DataFrame is identical.
+ by_blocks : bool, default False
+ Specify how to compare internal data. If False, compare by columns.
+ If True, compare by blocks.
+ check_exact : bool, default False
+ Whether to compare number exactly.
+ check_datetimelike_compat : bool, default False
+ Compare datetime-like which is comparable ignoring dtype.
+ check_categorical : bool, default True
+ Whether to compare internal Categorical exactly.
+ check_like : bool, default False
+ If True, ignore the order of index & columns.
+ Note: index labels must match their respective rows
+ (same as in columns) - same labels must be with the same data.
+ check_freq : bool, default True
+ Whether to check the `freq` attribute on a DatetimeIndex or
TimedeltaIndex.
+ check_flags : bool, default True
+ Whether to check the `flags` attribute.
+ rtol : float, default 1e-5
+ Relative tolerance. Only used when check_exact is False.
+ atol : float, default 1e-8
+ Absolute tolerance. Only used when check_exact is False.
+ obj : str, default 'DataFrame'
+ Specify object name being compared, internally used to show appropriate
+ assertion message.
+
+ See Also
+ --------
+ assert_series_equal : Equivalent method for asserting Series equality.
+ DataFrame.equals : Check DataFrame equality.
+
+ Examples
+ --------
+ This example shows comparing two DataFrames that are equal
+ but with columns of differing dtypes.
+
+ >>> from pyspark.pandas.testing import assert_frame_equal
+ >>> df1 = ps.DataFrame({'a': [1, 2], 'b': [3, 4]})
+ >>> df2 = ps.DataFrame({'a': [1, 2], 'b': [3.0, 4.0]})
+
+ df1 equals itself.
+
+ >>> assert_frame_equal(df1, df1)
+
+ df1 differs from df2 as column 'b' is of a different type.
+
+ >>> assert_frame_equal(df1, df2)
+ Traceback (most recent call last):
+ ...
+ AssertionError: Attributes of DataFrame.iloc[:, 1] (column name="b") are
different
+ <BLANKLINE>
+ Attribute "dtype" are different
+ [left]: int64
+ [right]: float64
+
+ Ignore differing dtypes in columns with check_dtype.
+
+ >>> assert_frame_equal(df1, df2, check_dtype=False)
+ """
+ if isinstance(left, ps.DataFrame):
+ left = left.to_pandas()
+ if isinstance(right, ps.DataFrame):
+ right = right.to_pandas()
+
+ pd.testing.assert_frame_equal(
+ left,
+ right,
+ check_dtype=check_dtype,
+ check_index_type=check_index_type, # type: ignore[arg-type]
+ check_column_type=check_column_type, # type: ignore[arg-type]
+ check_frame_type=check_frame_type,
+ check_names=check_names,
+ by_blocks=by_blocks,
+ check_exact=check_exact,
+ check_datetimelike_compat=check_datetimelike_compat,
+ check_categorical=check_categorical,
+ check_like=check_like,
+ check_freq=check_freq,
+ check_flags=check_flags,
+ rtol=rtol,
+ atol=atol,
+ obj=obj,
+ )
+
+
+def assert_series_equal(
+ left: Union[ps.Series, pd.Series],
+ right: Union[ps.Series, pd.Series],
+ check_dtype: bool = True,
+ check_index_type: Union[bool, Literal["equiv"]] = "equiv",
+ check_series_type: bool = True,
+ check_names: bool = True,
+ check_exact: bool = False,
+ check_datetimelike_compat: bool = False,
+ check_categorical: bool = True,
+ check_category_order: bool = True,
+ check_freq: bool = True,
+ check_flags: bool = True,
+ rtol: float = 1.0e-5,
+ atol: float = 1.0e-8,
+ obj: str = "Series",
+ *,
+ check_index: bool = True,
+ check_like: bool = False,
+) -> None:
+ """
+ Check that left and right Series are equal.
+
+ .. versionadded:: 4.0.0
+
+ Parameters
+ ----------
+ left : Series
+ right : Series
+ check_dtype : bool, default True
+ Whether to check the Series dtype is identical.
+ check_index_type : bool or {'equiv'}, default 'equiv'
+ Whether to check the Index class, dtype and inferred_type
+ are identical.
+ check_series_type : bool, default True
+ Whether to check the Series class is identical.
+ check_names : bool, default True
+ Whether to check the Series and Index names attribute.
+ check_exact : bool, default False
+ Whether to compare number exactly.
+ check_datetimelike_compat : bool, default False
+ Compare datetime-like which is comparable ignoring dtype.
+ check_categorical : bool, default True
+ Whether to compare internal Categorical exactly.
+ check_category_order : bool, default True
+ Whether to compare category order of internal Categoricals.
+ check_freq : bool, default True
+ Whether to check the `freq` attribute on a DatetimeIndex or
TimedeltaIndex.
+ check_flags : bool, default True
+ Whether to check the `flags` attribute.
+ rtol : float, default 1e-5
+ Relative tolerance. Only used when check_exact is False.
+ atol : float, default 1e-8
+ Absolute tolerance. Only used when check_exact is False.
+ obj : str, default 'Series'
+ Specify object name being compared, internally used to show appropriate
+ assertion message.
+ check_index : bool, default True
+ Whether to check index equivalence. If False, then compare only values.
+ check_like : bool, default False
+ If True, ignore the order of the index. Must be False if check_index
is False.
+ Note: same labels must be with the same data.
+
+ Examples
+ --------
+ >>> from pyspark.pandas import testing as tm
+ >>> a = ps.Series([1, 2, 3, 4])
+ >>> b = ps.Series([1, 2, 3, 4])
+ >>> tm.assert_series_equal(a, b)
+ """
+ if isinstance(left, ps.Series):
+ left = left.to_pandas()
+ if isinstance(right, ps.Series):
+ right = right.to_pandas()
+
+ pd.testing.assert_series_equal( # type: ignore[call-arg]
+ left,
+ right,
+ check_dtype=check_dtype,
+ check_index_type=check_index_type, # type: ignore[arg-type]
+ check_series_type=check_series_type,
+ check_names=check_names,
+ check_exact=check_exact,
+ check_datetimelike_compat=check_datetimelike_compat,
+ check_categorical=check_categorical,
+ check_category_order=check_category_order,
+ check_freq=check_freq,
+ check_flags=check_flags,
+ rtol=rtol, # type: ignore[arg-type]
+ atol=atol, # type: ignore[arg-type]
+ obj=obj,
+ check_index=check_index,
+ check_like=check_like,
+ )
+
+
+def assert_index_equal(
+ left: Union[ps.Index, pd.Index],
+ right: Union[ps.Index, pd.Index],
+ exact: Union[bool, Literal["equiv"]] = "equiv",
+ check_names: bool = True,
+ check_exact: bool = True,
+ check_categorical: bool = True,
+ check_order: bool = True,
+ rtol: float = 1.0e-5,
+ atol: float = 1.0e-8,
+ obj: str = "Index",
+) -> None:
+ """
+ Check that left and right Index are equal.
+
+ .. versionadded:: 4.0.0
+
+ Parameters
+ ----------
+ left : Index
+ right : Index
+ exact : bool or {'equiv'}, default 'equiv'
+ Whether to check the Index class, dtype and inferred_type
+ are identical. If 'equiv', then RangeIndex can be substituted for
+ Index with an int64 dtype as well.
+ check_names : bool, default True
+ Whether to check the names attribute.
+ check_exact : bool, default True
+ Whether to compare number exactly.
+ check_categorical : bool, default True
+ Whether to compare internal Categorical exactly.
+ check_order : bool, default True
+ Whether to compare the order of index entries as well as their values.
+ If True, both indexes must contain the same elements, in the same
order.
+ If False, both indexes must contain the same elements, but in any
order.
+ rtol : float, default 1e-5
+ Relative tolerance. Only used when check_exact is False.
+ atol : float, default 1e-8
+ Absolute tolerance. Only used when check_exact is False.
+ obj : str, default 'Index'
+ Specify object name being compared, internally used to show appropriate
+ assertion message.
+
+ Examples
+ --------
+ >>> from pyspark.pandas import testing as tm
+ >>> a = ps.Index([1, 2, 3])
+ >>> b = ps.Index([1, 2, 3])
+ >>> tm.assert_index_equal(a, b)
+ """
+ if isinstance(left, ps.Index):
+ left = left.to_pandas()
+ if isinstance(right, ps.Index):
+ right = right.to_pandas()
+
+ pd.testing.assert_index_equal( # type: ignore[call-arg]
+ left,
+ right,
+ exact=exact,
+ check_names=check_names,
+ check_exact=check_exact,
+ check_categorical=check_categorical,
+ check_order=check_order,
+ rtol=rtol,
+ atol=atol,
+ obj=obj,
+ )
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]