andhus commented on issue #44696:
URL: https://github.com/apache/arrow/issues/44696#issuecomment-3532192904

   Thanks for the amazing work on pyarrow 🙏 
   
   I have a similar issue on Python 3.14 and (which requires) pyarrow 22.0.0. I 
would like to avoid conda. Dropping a summary of my situation and a (hacky) 
patch to dodge the duble registraion.
   
   DISCALIMER/I am aware of:
   
   > Having multiple versions of libarrow in play seems like something we'd 
want to avoid in any case; more subtle errors could arise than this KeyError 
with filesystem registration. Maybe we should try to assert this at runtime?
   
   so this is just a workaround that seems to work for me so far.
   
   Maybe it can be of help to someone - or let me know if it is 
misleading/strongly discouraged 😅 
   
   ## Environment
   
   ```bash
   # System
   $ sw_vers
   ProductName:         macOS
   ProductVersion:              15.6
   BuildVersion:                24G84
   
   # Python
   $ python --version
   Python 3.14.0
   
   # GDAL
   $ gdal-config --version
   3.11.5
   
   # Python packages
   $ pip list | grep -E "(pyarrow|rasterio|geopandas)"
   geopandas    1.1.1
   pyarrow      22.0.0
   rasterio     1.4.3
   ```
   
   ## Issue
   
   When using PyArrow 22.0.0 alongside GDAL-based libraries (rasterio, fiona, 
pyogrio) on Python 3.14, we encounter:
   
   ```python
   ArrowKeyError: Attempted to register factory for scheme 'file' but that 
scheme is already registered.
   ```
   
   ### Minimal Reproduction
   
   ```python
   # This fails:
   import rasterio  # or fiona, or pyogrio
   from pyarrow import fs
   local_fs = fs.LocalFileSystem()  # ArrowKeyError
   
   # This also fails:
   from pyarrow import fs
   local_fs1 = fs.LocalFileSystem()  # OK
   import rasterio
   local_fs2 = fs.LocalFileSystem()  # ArrowKeyError
   ```
   
   The error occurs when:
   1. GDAL-based library is imported (registers `file://` scheme)
   2. PyArrow tries to create a `LocalFileSystem` (also tries to register 
`file://` scheme)
   
   Or vice versa - whichever tries to register second fails.
   
   ## Hacky patch solution:
   
   ```python
   """Patch for PyArrow 22.0.0 + GDAL compatibility issue.
   
   This patches PyArrow's `_resolve_filesystem_and_path` function to reuse a 
cached
   LocalFileSystem instance instead of creating new ones, which avoids the
   "Attempted to register factory for scheme 'file' but that scheme is already 
registered"
   error when using PyArrow alongside GDAL-based libraries (rasterio, fiona, 
pyogrio).
   
   Root cause: PyArrow (from PyPI) and GDAL (from Homebrew on macOS) each 
bundle their
   own libarrow library, and both try to register the 'file://' URI scheme 
handler.
   
   Usage:
   # At the top of your script or python session:
   
   from <my_package>.pyarrow_gdal_patch import apply_patch
   
   apply_patch()
   
   
   It is also recommended to import and apply the patch in your package's root
   `__init__.py` file to ensure it is applied before any other imports:
   
   # Auto-apply PyArrow/GDAL compatibility patch. This must be imported before 
any other
   # modules that use PyArrow or GDAL to avoid "Attempted to register factory 
for scheme
   # 'file'" errors. See `pyarrow_gdal_patch.py` for details.
   
   from <my_package>.pyarrow_gdal_patch import apply_patch
   
   apply_patch()
   """
   
   import pyarrow.fs as arrow_fs
   from pyarrow._fs import FileSystem, LocalFileSystem
   
   # Cache LocalFileSystem instances for both memory_map settings
   # We create both upfront before any GDAL imports to avoid registration 
conflicts
   _cached_local_fs_no_mmap: LocalFileSystem | None = None
   _cached_local_fs_with_mmap: LocalFileSystem | None = None
   
   # Store original function (in case we need to unpatch)
   _original_resolve = arrow_fs._resolve_filesystem_and_path  # type: ignore
   
   
   def _patched_resolve_filesystem_and_path(
       path,  # type: ignore
       filesystem=None,  # type: ignore
       *,
       memory_map=False,  # type: ignore
   ):
       """Patched version of PyArrow's _resolve_filesystem_and_path that reuses
       cached LocalFileSystem instances to avoid re-registration conflicts.
   
       We maintain two cached instances (one for memory_map=True, one for False)
       that are created upfront before any GDAL imports. This allows us to 
respect
       the memory_map parameter without triggering the 'file://' scheme 
registration
       conflict with GDAL.
       """
       global _cached_local_fs_no_mmap, _cached_local_fs_with_mmap
   
       # Original logic from pyarrow/fs.py
       if not arrow_fs._is_path_like(path):  # type: ignore
           if filesystem is not None:
               raise ValueError(
                   "'filesystem' passed but the specified path is file-like, so"
                   " there is nothing to open with 'filesystem'."
               )
           return filesystem, path
   
       if filesystem is not None:
           filesystem = arrow_fs._ensure_filesystem(filesystem, 
use_mmap=memory_map)  # type: ignore
           if isinstance(filesystem, LocalFileSystem):
               path = arrow_fs._stringify_path(path)  # type: ignore
           elif not isinstance(path, str):
               raise TypeError(
                   "Expected string path; path-like objects are only allowed "
                   "with a local filesystem"
               )
           path = filesystem.normalize_path(path)  # type: ignore
           return filesystem, path
   
       path = arrow_fs._stringify_path(path)  # type: ignore
   
       # PATCH: Reuse cached LocalFileSystem instead of creating a new one
       # This is the key change - line 160 in the original creates:
       # LocalFileSystem(use_mmap=memory_map)
       # We use the appropriate cached instance based on memory_map parameter
       if memory_map:
           if _cached_local_fs_with_mmap is None:
               _cached_local_fs_with_mmap = LocalFileSystem(use_mmap=True)
           filesystem = _cached_local_fs_with_mmap
       else:
           if _cached_local_fs_no_mmap is None:
               _cached_local_fs_no_mmap = LocalFileSystem(use_mmap=False)
           filesystem = _cached_local_fs_no_mmap
   
       try:
           file_info = filesystem.get_file_info(path)  # type: ignore
       except ValueError:
           file_info = None
           exists_locally = False
       else:
           exists_locally = file_info.type != arrow_fs.FileType.NotFound  # 
type: ignore
   
       # If file doesn't exist locally, try parsing as URI
       if not exists_locally:
           try:
               filesystem, path = FileSystem.from_uri(path)  # type: ignore
           except ValueError as e:
               msg = str(e)
               if "empty scheme" in msg or "Cannot parse URI" in msg:
                   # Neither URI nor locally existing path - will propagate 
file not found
                   pass
               else:
                   raise e
       else:
           path = filesystem.normalize_path(path)  # type: ignore
   
       return filesystem, path
   
   
   def apply_patch():
       """Apply the PyArrow/GDAL compatibility patch.
   
       This creates both LocalFileSystem instances (memory_map=True and False)
       upfront before any GDAL imports. This allows us to respect the memory_map
       parameter while avoiding the 'file://' scheme registration conflict.
       """
       global _cached_local_fs_no_mmap, _cached_local_fs_with_mmap
   
       # Replace the function
       arrow_fs._resolve_filesystem_and_path = 
_patched_resolve_filesystem_and_path
   
       # Initialize BOTH cached LocalFileSystem instances before any GDAL 
imports
       # This must happen before GDAL imports to claim the 'file://' scheme
       from pyarrow import fs
   
       if _cached_local_fs_no_mmap is None:
           _cached_local_fs_no_mmap = fs.LocalFileSystem(use_mmap=False)  # 
type: ignore
       if _cached_local_fs_with_mmap is None:
           _cached_local_fs_with_mmap = fs.LocalFileSystem(use_mmap=True)  # 
type: ignore
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to