[
https://issues.apache.org/jira/browse/SPARK-51112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Venkata Sai Akhil Gudesa updated SPARK-51112:
---------------------------------------------
Description:
Run the following code with a running local connect server:
{code:java}
from pyspark.sql.types import StructField, ArrayType, StringType, StructType,
IntegerType
import faulthandler
faulthandler.enable()
spark = SparkSession.builder \
.remote("sc://localhost:15002") \
.getOrCreate()
sp_df = spark.createDataFrame(
data = [],
schema=StructType(
[
StructField(
name='b_int',
dataType=IntegerType(),
nullable=False,
),
StructField(
name='b',
dataType=ArrayType(ArrayType(StringType(), True), True),
nullable=True,
),
]
)
)
print(sp_df)
print('Spark dataframe generated.')
print(sp_df.toPandas())
print('Pandas dataframe generated.') {code}
When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg
fault is non-deterministic and does not occur every single time.
Observations:
* When I added some sample data, the issue went away and the conversion was
successfull.
* When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to
{{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution was
successful *regardless of data.*
* When I converted the nested array column into a JSON field using {{to_json}}
(and dropped the original nested array column) , there was again no seg fault,
and execution was successful *regardless of data.*
Conculsion: There is an issue in pyarrow/pandas that is triggered when
converting empty datasets containing nested array columns.
was:
Run the following code with a running local connect server:
{code:java}
import sys
from pyspark.sql.types import StructField, ArrayType, StringType, StructType,
IntegerType
import faulthandler
faulthandler.enable()
spark = SparkSession.builder \
.remote("sc://localhost:15002") \
.getOrCreate()
sp_df = spark.createDataFrame(
data = [],
schema=StructType(
[
StructField(
name='b_int',
dataType=IntegerType(),
nullable=False,
),
StructField(
name='b',
dataType=ArrayType(ArrayType(StringType(), True), True),
nullable=True,
),
]
)
)
print(sp_df)
print('Spark dataframe generated.')
print(sp_df.toPandas())
print('Pandas dataframe generated.') {code}
When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg
fault is non-deterministic and does not occur every single time.
Observations:
* When I added some sample data, the issue went away and the conversion was
successfull.
* When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to
{{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution was
successful *regardless of data.*
* When I converted the nested array column into a JSON field using {{to_json}}
(and dropped the original nested array column) , there was again no seg fault,
and execution was successful *regardless of data.*
Conculsion: There is an issue in pyarrow/pandas that is triggered when
converting empty datasets containing nested array columns.
> [Connect] Seg fault when converting empty dataframe with nested array columns
> to pandas
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-51112
> URL: https://issues.apache.org/jira/browse/SPARK-51112
> Project: Spark
> Issue Type: Bug
> Components: Connect, PySpark
> Affects Versions: 4.0.0, 4.1.0
> Reporter: Venkata Sai Akhil Gudesa
> Priority: Major
>
> Run the following code with a running local connect server:
> {code:java}
> from pyspark.sql.types import StructField, ArrayType, StringType, StructType,
> IntegerType
> import faulthandler
> faulthandler.enable()
> spark = SparkSession.builder \
> .remote("sc://localhost:15002") \
> .getOrCreate()
> sp_df = spark.createDataFrame(
> data = [],
> schema=StructType(
> [
> StructField(
> name='b_int',
> dataType=IntegerType(),
> nullable=False,
> ),
> StructField(
> name='b',
> dataType=ArrayType(ArrayType(StringType(), True), True),
> nullable=True,
> ),
> ]
> )
> )
> print(sp_df)
> print('Spark dataframe generated.')
> print(sp_df.toPandas())
> print('Pandas dataframe generated.') {code}
> When `sp_df.toPandas()` is called, a segmentation fault may occur. The seg
> fault is non-deterministic and does not occur every single time.
> Observations:
> * When I added some sample data, the issue went away and the conversion was
> successfull.
> * When I changed {{ArrayType(ArrayType(StringType(), True), True)}} to
> {{{}ArrayType(StringType(), True){}}}, there was no seg fault and execution
> was successful *regardless of data.*
> * When I converted the nested array column into a JSON field using
> {{to_json}} (and dropped the original nested array column) , there was again
> no seg fault, and execution was successful *regardless of data.*
>
> Conculsion: There is an issue in pyarrow/pandas that is triggered when
> converting empty datasets containing nested array columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]