[
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ian Cook updated SPARK-48302:
-----------------------------
Description:
Because of a limitation in PyArrow, when PyArrow Tables are passed to
{{spark.createDataFrame()}}, null values in MapArray columns are replaced with
empty lists.
The PySpark function where this happens is
{{pyspark.sql.pandas.types._check_arrow_array_timestamps_localize}}.
Also see [https://github.com/apache/arrow/issues/41684].
See the skipped tests and the TODO mentioning SPARK-48302.
A possible fix for this will involve adding a {{mask}} argument to
{{pa.MapArray.from_arrays}}. But since older versions of PyArrow (which PySpark
will still support for a while) won't have this argument, we will need to do a
check like:
{{if LooseVersion(pa._{_}version{_}_) >= LooseVersion("1X.0.0"):}}
or
{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}
and only passĀ {{mask}} if that's true.
was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to
spark.createDataFrame(), null values in MapArray columns are replaced with
empty lists.
The PySpark function where this happens is pyspark.sql.pandas.types.
_check_arrow_array_timestamps_localize.
Also see [https://github.com/apache/arrow/issues/41684].
See the skipped tests and the TODO mentioning SPARK-48302.
> Null values in map columns of PyArrow tables are replaced with empty lists
> --------------------------------------------------------------------------
>
> Key: SPARK-48302
> URL: https://issues.apache.org/jira/browse/SPARK-48302
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 4.0.0
> Reporter: Ian Cook
> Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to
> {{spark.createDataFrame()}}, null values in MapArray columns are replaced
> with empty lists.
> The PySpark function where this happens is
> {{pyspark.sql.pandas.types._check_arrow_array_timestamps_localize}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> A possible fix for this will involve adding a {{mask}} argument to
> {{pa.MapArray.from_arrays}}. But since older versions of PyArrow (which
> PySpark will still support for a while) won't have this argument, we will
> need to do a check like:
> {{if LooseVersion(pa._{_}version{_}_) >= LooseVersion("1X.0.0"):}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only passĀ {{mask}} if that's true.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]