[
https://issues.apache.org/jira/browse/SPARK-55056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-55056:
-----------------------------------
Labels: pull-request-available (was: )
> toPandas() crashes with SIGSEGV on nested empty arrays
> ------------------------------------------------------
>
> Key: SPARK-55056
> URL: https://issues.apache.org/jira/browse/SPARK-55056
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Critical
> Labels: pull-request-available
>
> {{toPandas()}} crashes with SIGSEGV when a DataFrame contains nested array
> types (depth >= 3) with an empty outer array.
> {code:python}
> schema = StructType([
> StructField("data", ArrayType(StructType([
> StructField("arr", ArrayType(StructType([
> StructField("inner", ArrayType(StringType()))
> ])))
> ])))
> ])
> df = spark.createDataFrame([Row(data=[])], schema=schema)
> df.toPandas() # SIGSEGV
> {code}
> Arrow format requires ListArray offset buffer to have N+1 entries. Even when
> N=0, the buffer must contain {{{}[0]{}}}. When the outer array is empty,
> nested {{ArrayWriters are never invoked, so their count}} stays 0. Then
> {{getBufferSizeFor(0)}} returns 0, and the offset buffer is omitted in IPC
> serialization — violating Arrow spec.
> {code:scala}
> // ArrayWriter.scala - current behavior
> override def setValue(...): Unit = {
> while (i < array.numElements()) { // never runs when empty
> elementWriter.write(array, i) // nested writer never called
> }
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]