[
https://issues.apache.org/jira/browse/SPARK-54068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043381#comment-18043381
]
Ashrith Bandla edited comment on SPARK-54068 at 12/7/25 6:47 PM:
-----------------------------------------------------------------
Hey [~dongjoon], I came across your logic that skipped some tests in
test_feather.py if the PyArrow version was too high since the test would fail,
but it now works properly with my fixes and passes the tests. I fixed the issue
by converting some of the metrics to dictionaries before storing them in
Dataframe attrs. This is backwards compatible with previous versions of PyArrow
as well. I added my changes in a short PR here,
[https://github.com/apache/spark/pull/53377].
was (Author: JIRAUSER311521):
Hey [~dongjoon], I came across your logic that skipped some tests in
test_feather.py if the PyArrow version was too high since the test would fail,
but it now works properly with my fixe and passes the tests. I fixed the issue
by converting some of the metrics to dictionaries before storing them in
Dataframe attrs. This is backwards compatible with previous versions of PyArrow
as well. I added my changes in a short PR here,
[https://github.com/apache/spark/pull/53377].
> Fix
> `pyspark.pandas.tests.connect.io.test_parity_feather.FeatherParityTests.test_to_feather`
> in PyArrow 22.0.0
> --------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-54068
> URL: https://issues.apache.org/jira/browse/SPARK-54068
> Project: Spark
> Issue Type: Sub-task
> Components: Connect, PySpark
> Affects Versions: 4.1.0
> Reporter: Dongjoon Hyun
> Priority: Blocker
> Labels: pull-request-available
>
> {code}
> ======================================================================
> ERROR [1.960s]: test_to_feather
> (pyspark.pandas.tests.connect.io.test_parity_feather.FeatherParityTests.test_to_feather)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
> File "/__w/spark/spark/python/pyspark/pandas/tests/io/test_feather.py",
> line 43, in test_to_feather
> self.psdf.to_feather(path2)
> ~~~~~~~~~~~~~~~~~~~~^^^^^^^
> File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 2702, in
> to_feather
> return validate_arguments_and_invoke_function(
> self._to_internal_pandas(), self.to_feather, pd.DataFrame.to_feather,
> args
> )
> File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 592, in
> validate_arguments_and_invoke_function
> return pandas_func(**args)
> File "/usr/local/lib/python3.14/dist-packages/pandas/core/frame.py", line
> 2949, in to_feather
> to_feather(self, path, **kwargs)
> ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
> File "/usr/local/lib/python3.14/dist-packages/pandas/io/feather_format.py",
> line 65, in to_feather
> feather.write_feather(df, handles.handle, **kwargs)
> ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File "/usr/local/lib/python3.14/dist-packages/pyarrow/feather.py", line
> 156, in write_feather
> table = Table.from_pandas(df, preserve_index=preserve_index)
> File "pyarrow/table.pxi", line 4795, in pyarrow.lib.Table.from_pandas
> File "/usr/local/lib/python3.14/dist-packages/pyarrow/pandas_compat.py",
> line 663, in dataframe_to_arrays
> pandas_metadata = construct_metadata(
> columns_to_convert, df, column_names, index_columns,
> index_descriptors,
> preserve_index, types, column_field_names=column_field_names
> )
> File "/usr/local/lib/python3.14/dist-packages/pyarrow/pandas_compat.py",
> line 281, in construct_metadata
> b'pandas': json.dumps({
> ~~~~~~~~~~^^
> 'index_columns': index_descriptors,
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ...<7 lines>...
> 'pandas_version': _pandas_api.version
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> }).encode('utf8')
> ^^
> File "/usr/lib/python3.14/json/__init__.py", line 231, in dumps
> return _default_encoder.encode(obj)
> ~~~~~~~~~~~~~~~~~~~~~~~^^^^^
> File "/usr/lib/python3.14/json/encoder.py", line 200, in encode
> chunks = self.iterencode(o, _one_shot=True)
> File "/usr/lib/python3.14/json/encoder.py", line 261, in iterencode
> return _iterencode(o, 0)
> File "/usr/lib/python3.14/json/encoder.py", line 180, in default
> raise TypeError(f'Object of type {o.__class__.__name__} '
> f'is not JSON serializable')
> TypeError: Object of type PlanMetrics is not JSON serializable
> when serializing list item 0
> when serializing dict item 'metrics'
> when serializing dict item 'attributes'
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]