Yicong Huang created SPARK-55056:
------------------------------------
Summary: toPandas() crashes with SIGSEGV on nested empty arrays
Key: SPARK-55056
URL: https://issues.apache.org/jira/browse/SPARK-55056
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
{{toPandas()}} crashes with SIGSEGV when a DataFrame contains nested array
types (depth >= 3) with an empty outer array.
{code:python}
schema = StructType([
StructField("data", ArrayType(StructType([
StructField("arr", ArrayType(StructType([
StructField("inner", ArrayType(StringType()))
])))
])))
])
df = spark.createDataFrame([Row(data=[])], schema=schema)
df.toPandas() # SIGSEGV
{code}
Arrow format requires ListArray offset buffer to have N+1 entries. Even when
N=0, the buffer must contain {{\[0\]}}. When the outer array is empty, nested
{{ArrayWriter}}s are never invoked, so their {{count}} stays 0. Then
{{getBufferSizeFor(0)}} returns 0, and the offset buffer is omitted in IPC
serialization — violating Arrow spec.
{code:scala}
// ArrayWriter.scala - current behavior
override def setValue(...): Unit = {
while (i < array.numElements()) { // never runs when empty
elementWriter.write(array, i) // nested writer never called
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]