HeartSaVioR opened a new pull request, #52479:
URL: https://github.com/apache/spark/pull/52479
### What changes were proposed in this pull request?
This PR proposes to remove the usage of fetchWithArrow in
ListState.put/appendList.
### Why are the changes needed?
We have observed the case where Arrow path of sending the list has some
issue, while normal path does not have an issue.
The case is to have `None` value in IntegerType() in the element of list
state - the column is set to nullable=True hence that should be allowed, but
the error is raised during the conversion.
```
File
"/databricks/spark/python/pyspark/sql/streaming/stateful_processor.py", line
147, in put
self._listStateClient.put(self._stateName, newState)
File
"/databricks/spark/python/pyspark/sql/streaming/list_state_client.py", line
195, in put
self._stateful_processor_api_client._send_arrow_state(self.schema,
values)
File
"/spark/python/pyspark/sql/streaming/stateful_processor_api_client.py", line
604, in _send_arrow_state
pandas_df = convert_pandas_using_numpy_type(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/spark/python/pyspark/sql/pandas/types.py", line 1599, in
convert_pandas_using_numpy_type
df[field.name] = df[field.name].astype(np_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/generic.py", line
6643, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/python/lib/python3.12/site-packages/pandas/core/internals/managers.py", line
430, in astype
return self.apply(
^^^^^^^^^^^
File
"/python/lib/python3.12/site-packages/pandas/core/internals/managers.py", line
363, in apply
applied = getattr(b, f)(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File
"/python/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line
758, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py",
line 237, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py",
line 182, in astype_array
values = _astype_nansafe(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py",
line 133, in _astype_nansafe
return arr.astype(dtype, copy=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real
number, not 'NoneType'
```
Since we don't know how useful the Arrow based sending list is, it'd be
better not to try to fix the issue in the Arrow code path at this point and
just remove it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated the existing test to test the observed case.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]