rishav394 commented on issue #4363:
URL: https://github.com/apache/arrow-adbc/issues/4363#issuecomment-4637888784
Correction on my previous comment. I dug deeper today and the crash is
actually in this repo itself and not PyArrow.
Please **do not transfer**.
### Root cause
Use-after-free in `_reader.pyx`. The `_import_from_c` method shallow-copies
the `ArrowArrayStream`, then passes the original to PyArrow. When PyArrow
rejects the format string, it releases the original stream. Then `check_error`
calls `AdbcErrorFromArrayStream` on the copy, which still has dangling pointers
to the now-freed state -> SIGSEGV.
```python
helper.c_stream = deref(c_stream) # shallow copy (same pointers)
try:
reader = pyarrow.RecordBatchReader._import_from_c(int(address)) #
fails, releases stream
except Exception as e:
helper.check_error(e) # reads freed memory -> SIGSEGV
```
### Setup
```bash
docker run -d -p 8090:8080 trinodb/trino:latest
pip install adbc-driver-manager==1.8.0 pyarrow==14.0.2 dbc
dbc install trino
```
Any Go-based ADBC driver works (Trino, BigQuery, MySQL, etc.) - the bug is
in the driver manager, not the driver. Using Trino here because it's the
easiest to run locally.
```python
# common setup for all repros below
import faulthandler
faulthandler.enable()
import pyarrow as pa
from adbc_driver_manager import _lib, _reader
def get_decimal_stream():
db = _lib.AdbcDatabase(driver="trino",
uri="http://test@localhost:8090/memory/default")
conn = _lib.AdbcConnection(db)
stmt = _lib.AdbcStatement(conn)
stmt.set_sql_query("SELECT CAST(1 AS DECIMAL(5,2))")
handle, _ = stmt.execute_query()
return handle
```
### Repro 1: PyArrow does NOT crash
<details>
<summary>pyarrow.RecordBatchReader._import_from_c raises a clean
exception</summary>
```python
handle = get_decimal_stream()
try:
pa.RecordBatchReader._import_from_c(handle.address)
except Exception as e:
print(f"clean error: {e}")
```
```
clean error: Invalid or unsupported format string: 'd:5,2,32'
```
</details>
### Repro 2: arrow-adbc's wrapper crashes
<details>
<summary>AdbcRecordBatchReader._import_from_c calls check_error internally
-> SIGSEGV</summary>
```python
handle = get_decimal_stream()
_reader.AdbcRecordBatchReader._import_from_c(handle.address)
```
```
Fatal Python error: Segmentation fault
```
</details>
### Repro 3: check_error is the crash site
<details>
<summary>Manually calling check_error on the freed stream reproduces the
crash</summary>
```python
import ctypes
handle = get_decimal_stream()
helper = _reader._AdbcErrorHelper.__new__(_reader._AdbcErrorHelper)
ctypes.memmove(id(helper) + 16, handle.address, 40) # shallow copy, same as
_reader.pyx
try:
pa.RecordBatchReader._import_from_c(handle.address) # releases stream
except Exception as e:
print(f"_import_from_c: clean error -> {e}", flush=True)
print("calling check_error on freed stream ...", flush=True)
helper.check_error(e) # SIGSEGV
```
```
_import_from_c: clean error -> Invalid or unsupported format string:
'd:5,2,32'
calling check_error on freed stream ...
Fatal Python error: Segmentation fault
```
</details>
### Trigger
Any Go-based ADBC driver returning a type PyArrow doesn't recognize. Most
common today: DECIMAL with precision <= 18, where `driverbase-go` uses
`NarrowestDecimalType()` producing Decimal32/64 (format `d:p,s,32` or
`d:p,s,64`), unsupported by PyArrow < 15.
Not Decimal-specific - any unrecognized format string triggers the same
use-after-free in `check_error`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]