Yicong Huang created SPARK-55059:
------------------------------------
Summary: Remove empty table workaround in toPandas
Key: SPARK-55059
URL: https://issues.apache.org/jira/browse/SPARK-55059
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
SPARK-51112 added a workaround in \{{_convert_arrow_table_to_pandas()}} to
avoid segfault when converting empty tables with nested array columns:
{code:python}
# SPARK-51112: If the table is empty, we avoid using pyarrow to_pandas to
create the
# DataFrame, as it may fail with a segmentation fault.
if arrow_table.num_rows == 0:
column_data = (
pd.Series([], name=temp_col_names[i], dtype="object") for i in
range(len(schema.fields))
)
{code}
This workaround is no longer necessary after SPARK-55056, which fixed the root
cause in \{{ArrayWriter.finish()}} by properly initializing the Arrow ListArray
offset buffer when \{{count == 0}}.
Proposal: Remove the SPARK-51112 workaround and let pyarrow handle empty tables
directly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]