Re: [I] _import_from_c segfaults on unrecognized Arrow format strings (e.g. Decimal32/64 on PyArrow < 15) [arrow-adbc]

via GitHub Sat, 06 Jun 2026 01:00:50 -0700


rishav394 commented on issue #4363:
URL: https://github.com/apache/arrow-adbc/issues/4363#issuecomment-4637888784


   Correction on my previous comment. I dug deeper today and the crash is 
actually in this repo itself and not PyArrow.
   Please **do not transfer**.
   
   ### Root cause
   
   Use-after-free in `_reader.pyx`. The `_import_from_c` method shallow-copies 
the `ArrowArrayStream`, then passes the original to PyArrow. When PyArrow 
rejects the format string, it releases the original stream. Then `check_error` 
calls `AdbcErrorFromArrayStream` on the copy, which still has dangling pointers 
to the now-freed state -> SIGSEGV.
   
   ```python
   helper.c_stream = deref(c_stream)  # shallow copy (same pointers)
   try:
       reader = pyarrow.RecordBatchReader._import_from_c(int(address))  # 
fails, releases stream
   except Exception as e:
       helper.check_error(e)  # reads freed memory -> SIGSEGV
   ```
   
   ### Setup
   
   ```bash
   docker run -d -p 8090:8080 trinodb/trino:latest
   pip install adbc-driver-manager==1.8.0 pyarrow==14.0.2 dbc
   dbc install trino
   ```
   
   Any Go-based ADBC driver works (Trino, BigQuery, MySQL, etc.) - the bug is 
in the driver manager, not the driver. Using Trino here because it's the 
easiest to run locally.
   
   ```python
   # common setup for all repros below
   import faulthandler
   faulthandler.enable()
   import pyarrow as pa
   from adbc_driver_manager import _lib, _reader
   
   def get_decimal_stream():
       db = _lib.AdbcDatabase(driver="trino", 
uri="http://test@localhost:8090/memory/default";)
       conn = _lib.AdbcConnection(db)
       stmt = _lib.AdbcStatement(conn)
       stmt.set_sql_query("SELECT CAST(1 AS DECIMAL(5,2))")
       handle, _ = stmt.execute_query()
       return handle
   ```
   
   ### Repro 1: PyArrow does NOT crash
   
   <details>
   <summary>pyarrow.RecordBatchReader._import_from_c raises a clean 
exception</summary>
   
   ```python
   handle = get_decimal_stream()
   try:
       pa.RecordBatchReader._import_from_c(handle.address)
   except Exception as e:
       print(f"clean error: {e}")
   ```
   
   ```
   clean error: Invalid or unsupported format string: 'd:5,2,32'
   ```
   </details>
   
   ### Repro 2: arrow-adbc's wrapper crashes
   
   <details>
   <summary>AdbcRecordBatchReader._import_from_c calls check_error internally 
-> SIGSEGV</summary>
   
   ```python
   handle = get_decimal_stream()
   _reader.AdbcRecordBatchReader._import_from_c(handle.address)
   ```
   
   ```
   Fatal Python error: Segmentation fault
   ```
   </details>
   
   ### Repro 3: check_error is the crash site
   
   <details>
   <summary>Manually calling check_error on the freed stream reproduces the 
crash</summary>
   
   ```python
   import ctypes
   
   handle = get_decimal_stream()
   helper = _reader._AdbcErrorHelper.__new__(_reader._AdbcErrorHelper)
   ctypes.memmove(id(helper) + 16, handle.address, 40)  # shallow copy, same as 
_reader.pyx
   
   try:
       pa.RecordBatchReader._import_from_c(handle.address)  # releases stream
   except Exception as e:
       print(f"_import_from_c: clean error -> {e}", flush=True)
       print("calling check_error on freed stream ...", flush=True)
       helper.check_error(e)  # SIGSEGV
   ```
   
   ```
   _import_from_c: clean error -> Invalid or unsupported format string: 
'd:5,2,32'
   calling check_error on freed stream ...
   Fatal Python error: Segmentation fault
   ```
   </details>
   
   ### Trigger
   
   Any Go-based ADBC driver returning a type PyArrow doesn't recognize. Most 
common today: DECIMAL with precision <= 18, where `driverbase-go` uses 
`NarrowestDecimalType()` producing Decimal32/64 (format `d:p,s,32` or 
`d:p,s,64`), unsupported by PyArrow < 15.
   
   Not Decimal-specific - any unrecognized format string triggers the same 
use-after-free in `check_error`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] _import_from_c segfaults on unrecognized Arrow format strings (e.g. Decimal32/64 on PyArrow < 15) [arrow-adbc]

Reply via email to