[ 
https://issues.apache.org/jira/browse/SPARK-51112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-51112:
---------------------------------------------
    Description: 
Run the following code with a running local connect server:
{code:java}
import sys
from pyspark.sql.types import StructField, ArrayType, StringType, StructType, 
IntegerType
import faulthandler
faulthandler.enable()
spark = SparkSession.builder \
    .remote("sc://localhost:15002") \
    .getOrCreate()
sp_df = spark.createDataFrame(
    data = [],
    schema=StructType(
        [
            StructField(
                name='b_int',
                dataType=IntegerType(),
                nullable=False,
            ),
            StructField(
                name='b',
                dataType=ArrayType(ArrayType(StringType(), True), True),
                nullable=True,
            ),
        ]
    )
)
print(sp_df)
print('Spark dataframe generated.')
print(sp_df.toPandas())
print('Pandas dataframe generated.') {code}
When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg 
fault is non-deterministic and does not occur every single time.

Observations:
 * When I added some sample data, the issue went away and the conversion was 
successfull.

 * When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to 
{{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution was 
successful *regardless of data.*

 * When I converted the nested array column into a JSON field using {{to_json}} 
(and dropped the original nested array column) , there was again no seg fault, 
and execution was successful *regardless of data.*

 

Conculsion: There is an issue in pyarrow/pandas that is triggered when 
converting empty datasets containing nested array columns.

  was:
Run the following code with a running local connect server:

```

import sys
from pyspark.sql.types import StructField, ArrayType, StringType, StructType, 
IntegerType
import faulthandler
faulthandler.enable()
spark = SparkSession.builder \
    .remote("sc://localhost:15002") \
    .getOrCreate()
sp_df = spark.createDataFrame(
    data = [],
    schema=StructType(
        [
            StructField(
                name='b_int',
                dataType=IntegerType(),
                nullable=False,
            ),
            StructField(
                name='b',
                dataType=ArrayType(ArrayType(StringType(), True), True),
                nullable=True,
            ),
        ]
    )
)
print(sp_df)
print('Spark dataframe generated.')
print(sp_df.toPandas())
print('Pandas dataframe generated.')

```

When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg 
fault is non-deterministic and does not occur every single time.

Observations:
 * When I added some sample data, the issue went away and the conversion was 
successfull.

 * When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to 
{{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution was 
successful *regardless of data.*

 * When I converted the nested array column into a JSON field using {{to_json}} 
(and dropped the original nested array column) , there was again no seg fault, 
and execution was successful *regardless of data.*

 

Conculsion: There is an issue in pyarrow/pandas that is triggered when 
converting empty datasets containing nested array columns.


> [Connect] Seg fault when converting empty dataframe with nested array columns 
> to pandas
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-51112
>                 URL: https://issues.apache.org/jira/browse/SPARK-51112
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect, PySpark
>    Affects Versions: 4.0.0, 4.1.0
>            Reporter: Venkata Sai Akhil Gudesa
>            Priority: Major
>
> Run the following code with a running local connect server:
> {code:java}
> import sys
> from pyspark.sql.types import StructField, ArrayType, StringType, StructType, 
> IntegerType
> import faulthandler
> faulthandler.enable()
> spark = SparkSession.builder \
>     .remote("sc://localhost:15002") \
>     .getOrCreate()
> sp_df = spark.createDataFrame(
>     data = [],
>     schema=StructType(
>         [
>             StructField(
>                 name='b_int',
>                 dataType=IntegerType(),
>                 nullable=False,
>             ),
>             StructField(
>                 name='b',
>                 dataType=ArrayType(ArrayType(StringType(), True), True),
>                 nullable=True,
>             ),
>         ]
>     )
> )
> print(sp_df)
> print('Spark dataframe generated.')
> print(sp_df.toPandas())
> print('Pandas dataframe generated.') {code}
> When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg 
> fault is non-deterministic and does not occur every single time.
> Observations:
>  * When I added some sample data, the issue went away and the conversion was 
> successfull.
>  * When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to 
> {{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution 
> was successful *regardless of data.*
>  * When I converted the nested array column into a JSON field using 
> {{to_json}} (and dropped the original nested array column) , there was again 
> no seg fault, and execution was successful *regardless of data.*
>  
> Conculsion: There is an issue in pyarrow/pandas that is triggered when 
> converting empty datasets containing nested array columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to