[PR] [SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays [spark]

via GitHub Thu, 15 Jan 2026 13:55:12 -0800


Yicong-Huang opened a new pull request, #53822:
URL: https://github.com/apache/spark/pull/53822


   ### What changes were proposed in this pull request?
   
   Fix SIGSEGV in `toPandas()` when DataFrame contains triple-nested empty 
arrays (e.g., `Array<Array<Array<String>>>`).
   
   Modified `ArrayWriter.finish()` to simulate one empty write when `count == 
0`, ensuring Arrow ListArray offset buffer is properly initialized.
   
   ### Why are the changes needed?
   
   Arrow requires ListArray offset buffer to have N+1 entries (even when N=0, 
buffer must contain `[0]`). When outer array is empty, nested `ArrayWriter`s 
are never invoked, leaving `count=0`. Then `getBufferSizeFor(0)` returns 0, 
omitting the offset buffer — violating Arrow spec and causing SIGSEGV.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. `toPandas()` on triple-nested empty arrays no longer crashes.
   
   ### How was this patch tested?
   
   - 2 Scala unit tests in `ArrowWriterSuite`
   - 4 Python integration tests in `test_arrow.py`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55056][SQL][PYTHON] Fix toPandas() SIGSEGV on nested empty arrays [spark]

Reply via email to