Venkata Sai Akhil Gudesa created SPARK-51112:
------------------------------------------------

             Summary: [Connect] Seg fault when converting empty dataframe with 
nested array columns to pandas
                 Key: SPARK-51112
                 URL: https://issues.apache.org/jira/browse/SPARK-51112
             Project: Spark
          Issue Type: Bug
          Components: Connect, PySpark
    Affects Versions: 4.0.0, 4.1.0
            Reporter: Venkata Sai Akhil Gudesa


Run the following code with a running local connect server:

```

import sys
from pyspark.sql.types import StructField, ArrayType, StringType, StructType, 
IntegerType
import faulthandler
faulthandler.enable()
spark = SparkSession.builder \
    .remote("sc://localhost:15002") \
    .getOrCreate()
sp_df = spark.createDataFrame(
    data = [],
    schema=StructType(
        [
            StructField(
                name='b_int',
                dataType=IntegerType(),
                nullable=False,
            ),
            StructField(
                name='b',
                dataType=ArrayType(ArrayType(StringType(), True), True),
                nullable=True,
            ),
        ]
    )
)
print(sp_df)
print('Spark dataframe generated.')
print(sp_df.toPandas())
print('Pandas dataframe generated.')

```

When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg 
fault is non-deterministic and does not occur every single time.

Observations:
 * When I added some sample data, the issue went away and the conversion was 
successfull.

 * When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to 
{{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution was 
successful *regardless of data.*

 * When I converted the nested array column into a JSON field using {{to_json}} 
(and dropped the original nested array column) , there was again no seg fault, 
and execution was successful *regardless of data.*

 

Conculsion: There is an issue in pyarrow/pandas that is triggered when 
converting empty datasets containing nested array columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to