andhus commented on issue #44696:
URL: https://github.com/apache/arrow/issues/44696#issuecomment-3532192904
Thanks for the amazing work on pyarrow 🙏
I have a similar issue on Python 3.14 and (which requires) pyarrow 22.0.0. I
would like to avoid conda. Dropping a summary of my situation and a (hacky)
patch to dodge the duble registraion.
DISCALIMER/I am aware of:
> Having multiple versions of libarrow in play seems like something we'd
want to avoid in any case; more subtle errors could arise than this KeyError
with filesystem registration. Maybe we should try to assert this at runtime?
so this is just a workaround that seems to work for me so far.
Maybe it can be of help to someone - or let me know if it is
misleading/strongly discouraged 😅
## Environment
```bash
# System
$ sw_vers
ProductName: macOS
ProductVersion: 15.6
BuildVersion: 24G84
# Python
$ python --version
Python 3.14.0
# GDAL
$ gdal-config --version
3.11.5
# Python packages
$ pip list | grep -E "(pyarrow|rasterio|geopandas)"
geopandas 1.1.1
pyarrow 22.0.0
rasterio 1.4.3
```
## Issue
When using PyArrow 22.0.0 alongside GDAL-based libraries (rasterio, fiona,
pyogrio) on Python 3.14, we encounter:
```python
ArrowKeyError: Attempted to register factory for scheme 'file' but that
scheme is already registered.
```
### Minimal Reproduction
```python
# This fails:
import rasterio # or fiona, or pyogrio
from pyarrow import fs
local_fs = fs.LocalFileSystem() # ArrowKeyError
# This also fails:
from pyarrow import fs
local_fs1 = fs.LocalFileSystem() # OK
import rasterio
local_fs2 = fs.LocalFileSystem() # ArrowKeyError
```
The error occurs when:
1. GDAL-based library is imported (registers `file://` scheme)
2. PyArrow tries to create a `LocalFileSystem` (also tries to register
`file://` scheme)
Or vice versa - whichever tries to register second fails.
## Hacky patch solution:
```python
"""Patch for PyArrow 22.0.0 + GDAL compatibility issue.
This patches PyArrow's `_resolve_filesystem_and_path` function to reuse a
cached
LocalFileSystem instance instead of creating new ones, which avoids the
"Attempted to register factory for scheme 'file' but that scheme is already
registered"
error when using PyArrow alongside GDAL-based libraries (rasterio, fiona,
pyogrio).
Root cause: PyArrow (from PyPI) and GDAL (from Homebrew on macOS) each
bundle their
own libarrow library, and both try to register the 'file://' URI scheme
handler.
Usage:
# At the top of your script or python session:
from <my_package>.pyarrow_gdal_patch import apply_patch
apply_patch()
It is also recommended to import and apply the patch in your package's root
`__init__.py` file to ensure it is applied before any other imports:
# Auto-apply PyArrow/GDAL compatibility patch. This must be imported before
any other
# modules that use PyArrow or GDAL to avoid "Attempted to register factory
for scheme
# 'file'" errors. See `pyarrow_gdal_patch.py` for details.
from <my_package>.pyarrow_gdal_patch import apply_patch
apply_patch()
"""
import pyarrow.fs as arrow_fs
from pyarrow._fs import FileSystem, LocalFileSystem
# Cache LocalFileSystem instances for both memory_map settings
# We create both upfront before any GDAL imports to avoid registration
conflicts
_cached_local_fs_no_mmap: LocalFileSystem | None = None
_cached_local_fs_with_mmap: LocalFileSystem | None = None
# Store original function (in case we need to unpatch)
_original_resolve = arrow_fs._resolve_filesystem_and_path # type: ignore
def _patched_resolve_filesystem_and_path(
path, # type: ignore
filesystem=None, # type: ignore
*,
memory_map=False, # type: ignore
):
"""Patched version of PyArrow's _resolve_filesystem_and_path that reuses
cached LocalFileSystem instances to avoid re-registration conflicts.
We maintain two cached instances (one for memory_map=True, one for False)
that are created upfront before any GDAL imports. This allows us to
respect
the memory_map parameter without triggering the 'file://' scheme
registration
conflict with GDAL.
"""
global _cached_local_fs_no_mmap, _cached_local_fs_with_mmap
# Original logic from pyarrow/fs.py
if not arrow_fs._is_path_like(path): # type: ignore
if filesystem is not None:
raise ValueError(
"'filesystem' passed but the specified path is file-like, so"
" there is nothing to open with 'filesystem'."
)
return filesystem, path
if filesystem is not None:
filesystem = arrow_fs._ensure_filesystem(filesystem,
use_mmap=memory_map) # type: ignore
if isinstance(filesystem, LocalFileSystem):
path = arrow_fs._stringify_path(path) # type: ignore
elif not isinstance(path, str):
raise TypeError(
"Expected string path; path-like objects are only allowed "
"with a local filesystem"
)
path = filesystem.normalize_path(path) # type: ignore
return filesystem, path
path = arrow_fs._stringify_path(path) # type: ignore
# PATCH: Reuse cached LocalFileSystem instead of creating a new one
# This is the key change - line 160 in the original creates:
# LocalFileSystem(use_mmap=memory_map)
# We use the appropriate cached instance based on memory_map parameter
if memory_map:
if _cached_local_fs_with_mmap is None:
_cached_local_fs_with_mmap = LocalFileSystem(use_mmap=True)
filesystem = _cached_local_fs_with_mmap
else:
if _cached_local_fs_no_mmap is None:
_cached_local_fs_no_mmap = LocalFileSystem(use_mmap=False)
filesystem = _cached_local_fs_no_mmap
try:
file_info = filesystem.get_file_info(path) # type: ignore
except ValueError:
file_info = None
exists_locally = False
else:
exists_locally = file_info.type != arrow_fs.FileType.NotFound #
type: ignore
# If file doesn't exist locally, try parsing as URI
if not exists_locally:
try:
filesystem, path = FileSystem.from_uri(path) # type: ignore
except ValueError as e:
msg = str(e)
if "empty scheme" in msg or "Cannot parse URI" in msg:
# Neither URI nor locally existing path - will propagate
file not found
pass
else:
raise e
else:
path = filesystem.normalize_path(path) # type: ignore
return filesystem, path
def apply_patch():
"""Apply the PyArrow/GDAL compatibility patch.
This creates both LocalFileSystem instances (memory_map=True and False)
upfront before any GDAL imports. This allows us to respect the memory_map
parameter while avoiding the 'file://' scheme registration conflict.
"""
global _cached_local_fs_no_mmap, _cached_local_fs_with_mmap
# Replace the function
arrow_fs._resolve_filesystem_and_path =
_patched_resolve_filesystem_and_path
# Initialize BOTH cached LocalFileSystem instances before any GDAL
imports
# This must happen before GDAL imports to claim the 'file://' scheme
from pyarrow import fs
if _cached_local_fs_no_mmap is None:
_cached_local_fs_no_mmap = fs.LocalFileSystem(use_mmap=False) #
type: ignore
if _cached_local_fs_with_mmap is None:
_cached_local_fs_with_mmap = fs.LocalFileSystem(use_mmap=True) #
type: ignore
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]