pitrou commented on code in PR #45089:
URL: https://github.com/apache/arrow/pull/45089#discussion_r2177617535
##########
docs/source/python/filesystems.rst:
##########
@@ -388,6 +388,32 @@ Then all the functionalities of :class:`FileSystem` are
accessible::
ds.dataset("data/", filesystem=pa_fs)
+Using fsspec-compatible filesystem URIs
+---------------------------------------
+
+PyArrow can automatically instantiate fsspec filesystems by prefixing the URI
+scheme with ``fsspec+``. This allows you to use the fsspec-compatible
+filesystems directly with PyArrow's IO functions without needing to manually
+create a filesystem object. Example writing and reading a Parquet file
+using an in-memory filesystem provided by `fsspec`_::
+
+ import pyarrow as pa
+ import pyarrow.parquet as pq
+
+ table = pa.table({'a': [1, 2, 3]})
+ pq.write_table(table, "fsspec+memory://path/to/my_table.parquet")
+ pq.read_table("fsspec+memory://path/to/my_table.parquet")
+
+Example reading parquet file from GitHub directly::
+
+
pq.read_table("fsspec+github://apache:arrow-testing@/data/parquet/alltypes-java.parquet")
+
+Hugging Face's sceheme explicitly allowed as a shortcut without needing to
prefix
+with ``fsspec+``. This is useful for reading datasets hosted on Hugging Face::
Review Comment:
```suggestion
Hugging Face URIs are explicitly allowed as a shortcut without needing to
prefix
with ``fsspec+``. This is useful for reading datasets hosted on Hugging
Face::
```
##########
python/pyarrow/_fs.pyx:
##########
@@ -436,12 +468,19 @@ cdef class FileSystem(_Weakrefable):
----------
uri : string
URI-based path, for example: file:///some/local/path.
+ treat_path_as_prefix : bool, default False
+ If True, the path component of the URI is treated as a prefix
+ inside the FileSystem instance. This means that all operations
+ will be relative to this prefix, and the prefix must point to a
+ directory. If False, the path component is treated as an abstract
+ path inside the FileSystem instance.
Returns
-------
- tuple of (FileSystem, str path)
- With (filesystem, path) tuple where path is the abstract path
- inside the FileSystem instance.
+ tuple of (FileSystem, str path) or FileSystem
Review Comment:
It's not very nice for the return type to depend on argument values if we
ever want to add type hints for this.
We may keep it like this, or perhaps you'd rather expose two different
methods?
##########
python/pyarrow/_fs.pyx:
##########
@@ -424,7 +424,39 @@ cdef class FileSystem(_Weakrefable):
return fs
@staticmethod
- def from_uri(uri):
+ def _fsspec_from_uri(uri):
+ """Instantiate FSSpecHandler and path for the given URI."""
+ try:
+ import fsspec
+ except ImportError:
+ raise ImportError(
+ "`fsspec` is required to handle `fsspec+<filesystem>://` and
`hf://` URIs."
+ )
+ from .fs import FSSpecHandler
+
+ uri = uri.removeprefix("fsspec+")
+ fs, path = fsspec.url_to_fs(uri)
+ fs = PyFileSystem(FSSpecHandler(fs))
+
+ return fs, path
+
+ @staticmethod
+ def _native_from_uri(uri):
+ """Instantiate native FileSystem and path for the given URI."""
+ cdef:
+ c_string c_path
+ c_string c_uri
+ CResult[shared_ptr[CFileSystem]] result
+
+ if isinstance(uri, pathlib.Path):
+ # Make absolute
+ uri = uri.resolve().absolute()
+ c_uri = tobytes(_stringify_path(uri))
+ with nogil:
+ result = CFileSystemFromUriOrPath(c_uri, &c_path)
+ return FileSystem.wrap(GetResultValue(result)), frombytes(c_path)
+
+ def from_uri(uri, treat_path_as_prefix=False):
Review Comment:
Can we make the argument kw-only?
```suggestion
def from_uri(uri, *, treat_path_as_prefix=False):
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]