[ 
https://issues.apache.org/jira/browse/SPARK-55056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-55056:
-----------------------------------
    Labels: pull-request-available  (was: )

> toPandas() crashes with SIGSEGV on nested empty arrays
> ------------------------------------------------------
>
>                 Key: SPARK-55056
>                 URL: https://issues.apache.org/jira/browse/SPARK-55056
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Critical
>              Labels: pull-request-available
>
> {{toPandas()}} crashes with SIGSEGV when a DataFrame contains nested array 
> types (depth >= 3) with an empty outer array.
> {code:python}
> schema = StructType([
>     StructField("data", ArrayType(StructType([
>         StructField("arr", ArrayType(StructType([
>             StructField("inner", ArrayType(StringType()))
>         ])))
>     ])))
> ])
> df = spark.createDataFrame([Row(data=[])], schema=schema)
> df.toPandas()  # SIGSEGV
> {code}
> Arrow format requires ListArray offset buffer to have N+1 entries. Even when 
> N=0, the buffer must contain {{{}[0]{}}}. When the outer array is empty, 
> nested {{ArrayWriters are never invoked, so their count}} stays 0. Then 
> {{getBufferSizeFor(0)}} returns 0, and the offset buffer is omitted in IPC 
> serialization — violating Arrow spec.
> {code:scala}
> // ArrayWriter.scala - current behavior
> override def setValue(...): Unit = {
>   while (i < array.numElements()) {  // never runs when empty
>     elementWriter.write(array, i)    // nested writer never called
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to