bkietz commented on code in PR #41559:
URL: https://github.com/apache/arrow/pull/41559#discussion_r1966301268
##########
cpp/src/arrow/CMakeLists.txt:
##########
@@ -876,6 +876,18 @@ if(ARROW_FILESYSTEM)
foreach(ARROW_FILESYSTEM_TARGET ${ARROW_FILESYSTEM_TARGETS})
target_link_libraries(${ARROW_FILESYSTEM_TARGET} PRIVATE
${AWSSDK_LINK_LIBRARIES})
endforeach()
+
+ if(ARROW_S3_MODULE)
+ if(NOT ARROW_BUILD_SHARED)
+ message(FATAL_ERROR "ARROW_S3_MODULE without shared libarrow is not
supported")
+ endif()
+
+ add_library(arrow_s3fs MODULE filesystem/s3fs_module.cc
filesystem/s3fs.cc)
+ target_link_libraries(arrow_s3fs PRIVATE ${AWSSDK_LINK_LIBRARIES}
arrow_shared)
+ set_source_files_properties(filesystem/s3fs.cc filesystem/s3fs_module.cc
+ PROPERTIES SKIP_PRECOMPILE_HEADERS ON
+ SKIP_UNITY_BUILD_INCLUSION ON)
Review Comment:
> Did you had any idea on how to make the module expose the missing
functionality too?
The filesystem module supports only construction of filesystems from URI
strings (and thereafter we have access only to methods on the FileSystem base
interface) so a brief enumeration of functionality missing relative to
[pyarrow.fs.S3FileSystem](https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html)
is
- constructing the filesystem with `proxy_options`
- retrieving the AWS region a filesystem is connected to
- providing a retry_strategy to use when communicating with S3
It is *possible* to support all of these through URI strings alone, with
some caveats:
- encoding proxy options in the `s3://` URI scheme
- since it already has a [URI
representation](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/s3fs.h#L53)
that can be could be added as a single URI parameter like
`s3://u:p@mybucket/foo/bar?proxy_options=http%3A//proxy_u%3Aproxy_p@proxy_host%3A999`
- alternatively independent parameters could be added for `proxy_scheme`,
`proxy_username`, etc
- AWS region is a URI parameter, so if we need to retrieve the region from a
`FileSystem` we can convert it to a URI and read that parameter.
- Region is resolved when the S3FileSystem is constructed, so it won't
always be a copy of the region parameter in the original URI
- Python has not bound
[FileSystem::MakeUri](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/filesystem.h#L203)
yet, but after it does access to the s3 region could be written like
```
def get_s3_region(fs: FileSystem) -> str|None:
if fs.type_name != "s3":
return None
from urllib.parse import urlparse, parse_qs
params = parse_qs(urlparse(fs.to_uri()).query)
return params["region"][0]
```
- `S3RetryStrategy` is currently available in C++ as a [fully
user-extensible
interface](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/s3fs.h#L72),
which is of course not specifiable in a URI parameter. However in python at
least this is only exposed as the choice between standard or default strategies
and a [maximum number of
retries](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/python/pyarrow/_s3fs.pyx#L120).
This more limited enumeration of retry strategies *can* be specified in a
parameter, like `s3://u:p@mybucket/foo/bar?retry_strategy=standard/3`
- Full configuration of the retry strategy interface seems like a feature
which might be little used but a survey on the ML would be a good idea. If no
significant use cases arise, there can be a deprecation cycle for public
S3RetryStrategy
Once the above changes are made, the full functionality of
`pyarrow.fs.S3FileSystem` can be recovered using just `libarrow_s3fs`. At that
point, the option to build `libarrow` directly linked to the AWS SDK will be
redundant and can be removed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]