[ 
https://issues.apache.org/jira/browse/SPARK-55056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yicong Huang updated SPARK-55056:
---------------------------------
    Description: 
{{toPandas()}} crashes with SIGSEGV when a DataFrame contains nested array 
types (depth >= 3) with an empty outer array.
{code:python}
schema = StructType([
    StructField("data", ArrayType(StructType([
        StructField("arr", ArrayType(StructType([
            StructField("inner", ArrayType(StringType()))
        ])))
    ])))
])
df = spark.createDataFrame([Row(data=[])], schema=schema)
df.toPandas()  # SIGSEGV
{code}
Arrow format requires ListArray offset buffer to have N+1 entries. Even when 
N=0, the buffer must contain {{{}[0]{}}}. When the outer array is empty, nested 
{{ArrayWriters are never invoked, so their count}} stays 0. Then 
{{getBufferSizeFor(0)}} returns 0, and the offset buffer is omitted in IPC 
serialization — violating Arrow spec.
{code:scala}
// ArrayWriter.scala - current behavior
override def setValue(...): Unit = {
  while (i < array.numElements()) {  // never runs when empty
    elementWriter.write(array, i)    // nested writer never called
  }
}
{code}

  was:
{{toPandas()}} crashes with SIGSEGV when a DataFrame contains nested array 
types (depth >= 3) with an empty outer array.

{code:python}
schema = StructType([
    StructField("data", ArrayType(StructType([
        StructField("arr", ArrayType(StructType([
            StructField("inner", ArrayType(StringType()))
        ])))
    ])))
])
df = spark.createDataFrame([Row(data=[])], schema=schema)
df.toPandas()  # SIGSEGV
{code}

Arrow format requires ListArray offset buffer to have N+1 entries. Even when 
N=0, the buffer must contain {{\[0\]}}. When the outer array is empty, nested 
{{ArrayWriter}}s are never invoked, so their {{count}} stays 0. Then 
{{getBufferSizeFor(0)}} returns 0, and the offset buffer is omitted in IPC 
serialization — violating Arrow spec.

{code:scala}
// ArrayWriter.scala - current behavior
override def setValue(...): Unit = {
  while (i < array.numElements()) {  // never runs when empty
    elementWriter.write(array, i)    // nested writer never called
  }
}
{code}


> toPandas() crashes with SIGSEGV on nested empty arrays
> ------------------------------------------------------
>
>                 Key: SPARK-55056
>                 URL: https://issues.apache.org/jira/browse/SPARK-55056
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Critical
>
> {{toPandas()}} crashes with SIGSEGV when a DataFrame contains nested array 
> types (depth >= 3) with an empty outer array.
> {code:python}
> schema = StructType([
>     StructField("data", ArrayType(StructType([
>         StructField("arr", ArrayType(StructType([
>             StructField("inner", ArrayType(StringType()))
>         ])))
>     ])))
> ])
> df = spark.createDataFrame([Row(data=[])], schema=schema)
> df.toPandas()  # SIGSEGV
> {code}
> Arrow format requires ListArray offset buffer to have N+1 entries. Even when 
> N=0, the buffer must contain {{{}[0]{}}}. When the outer array is empty, 
> nested {{ArrayWriters are never invoked, so their count}} stays 0. Then 
> {{getBufferSizeFor(0)}} returns 0, and the offset buffer is omitted in IPC 
> serialization — violating Arrow spec.
> {code:scala}
> // ArrayWriter.scala - current behavior
> override def setValue(...): Unit = {
>   while (i < array.numElements()) {  // never runs when empty
>     elementWriter.write(array, i)    // nested writer never called
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to