Venkata Sai Akhil Gudesa created SPARK-51112:
------------------------------------------------
Summary: [Connect] Seg fault when converting empty dataframe with
nested array columns to pandas
Key: SPARK-51112
URL: https://issues.apache.org/jira/browse/SPARK-51112
Project: Spark
Issue Type: Bug
Components: Connect, PySpark
Affects Versions: 4.0.0, 4.1.0
Reporter: Venkata Sai Akhil Gudesa
Run the following code with a running local connect server:
```
import sys
from pyspark.sql.types import StructField, ArrayType, StringType, StructType,
IntegerType
import faulthandler
faulthandler.enable()
spark = SparkSession.builder \
.remote("sc://localhost:15002") \
.getOrCreate()
sp_df = spark.createDataFrame(
data = [],
schema=StructType(
[
StructField(
name='b_int',
dataType=IntegerType(),
nullable=False,
),
StructField(
name='b',
dataType=ArrayType(ArrayType(StringType(), True), True),
nullable=True,
),
]
)
)
print(sp_df)
print('Spark dataframe generated.')
print(sp_df.toPandas())
print('Pandas dataframe generated.')
```
When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg
fault is non-deterministic and does not occur every single time.
Observations:
* When I added some sample data, the issue went away and the conversion was
successfull.
* When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to
{{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution was
successful *regardless of data.*
* When I converted the nested array column into a JSON field using {{to_json}}
(and dropped the original nested array column) , there was again no seg fault,
and execution was successful *regardless of data.*
Conculsion: There is an issue in pyarrow/pandas that is triggered when
converting empty datasets containing nested array columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]