timsaucer opened a new issue, #1607: URL: https://github.com/apache/datafusion-python/issues/1607
## Describe the bug After the DataFusion 54 upgrade (#1562), importing `datafusion` and performing any Arrow-backed operation segfaults (SIGSEGV) when the installed PyArrow is 24.0.0 on macOS (arm64). The crash happens on the very first Arrow allocation made through the bindings — for example building a literal `lit(pa.scalar(0, type=pa.int32()))`, which is exactly what `python/datafusion/functions/spark.py` does at module import, so even a bare `import datafusion` crashes. This is a regression introduced on the 54 upgrade branch; it does **not** affect the released `datafusion-python` 53.0.0. ## Symptoms `import datafusion` (or any operation that constructs an Arrow value) terminates the process with `Segmentation fault: 11` (exit code 139). The native crash report points into PyArrow's own bundled mimalloc, not into our code: ``` mi_theap_malloc_zero_aligned_at_overalloc <- SIGSEGV (mimalloc v3 thread-heap) mi_theap_realloc_zero_aligned_at arrow::MimallocAllocator::ReallocateAligned arrow::PoolBuffer::Resize arrow::NumericBuilder<Int32Type>::FinishInternal arrow::py::ConvertPySequence __pyx_pw_7pyarrow_3lib_191scalar <- pa.scalar(0, type=int32()) ``` ## Root cause There are two independent mimalloc runtimes in the process: - `datafusion-python` installs mimalloc as the Rust `#[global_allocator]` (`crates/core/src/lib.rs`, enabled by the default `mimalloc` feature). - PyArrow 24 ships and defaults to its own bundled mimalloc memory pool. The DataFusion 54 dependency bump moved `libmimalloc-sys` 0.1.44 -> 0.1.49 (the `mimalloc` crate 0.1.48 -> 0.1.52), which changed the bundled allocator from mimalloc **v2** to mimalloc **v3**. PyArrow 24 also bundles mimalloc **v3**. Two mimalloc-v3 runtimes collide at the macOS process-global level (malloc-zone / thread-local-heap initialization), corrupting each other's thread heap and faulting on the first allocation. The 53.0.0 release shipped mimalloc **v2** (`libmimalloc-sys` 0.1.44), which coexists fine with PyArrow's v3 pool — which is why no released version is affected. ## Affected versions / platforms - **PyArrow**: 24.0.0 triggers it. PyArrow 20.0.0 through 23.0.1 are unaffected (verified against the 54-branch build). - **datafusion-python**: the in-progress 54 upgrade branch. Released 53.0.0 is **not** affected (verified with PyArrow 20–24). - **Platforms**: confirmed on macOS arm64. Linux is expected to be unaffected because PyArrow defaults to jemalloc there (only one mimalloc in the process). Windows defaults to mimalloc like macOS, so it is potentially affected, but the macOS-specific malloc-zone vector may not apply — needs verification in CI. ## Reproduction On macOS arm64 with a 54-branch build of `datafusion-python` and `pyarrow==24.0.0`: ```python import datafusion # segfaults here (spark.py builds an int32 literal at import) ``` or, isolating the allocation: ```python import pyarrow as pa from datafusion import lit lit(pa.scalar(0, type=pa.int32())) # SIGSEGV ``` ## Suggested fix Pin the bundled allocator to the mimalloc v2 line so two mimalloc-v3 runtimes never coexist. `libmimalloc-sys` (and the `mimalloc` crate) expose a `v2` feature for this; adding it to the `mimalloc` feature list in `crates/core/Cargo.toml` keeps the Rust global allocator (no performance loss, no PyArrow pin) and resolves the crash. This has been verified locally: with the `v2` feature the 54-branch build runs cleanly against PyArrow 24.0.0. A longer-term fix should investigate making two mimalloc-v3 instances coexist (or platform-gating the allocator), and we should add a CI smoke test that imports `datafusion` and constructs an Arrow literal against the newest PyArrow on macOS so this regression cannot return silently. ## Acceptance / testing The fix must include test coverage: a smoke test (run on macOS, and ideally Windows) that imports `datafusion` and builds an Arrow-backed literal under the newest supported PyArrow, asserting no crash. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
