bkietz commented on code in PR #41559: URL: https://github.com/apache/arrow/pull/41559#discussion_r1966301268
########## cpp/src/arrow/CMakeLists.txt: ########## @@ -876,6 +876,18 @@ if(ARROW_FILESYSTEM) foreach(ARROW_FILESYSTEM_TARGET ${ARROW_FILESYSTEM_TARGETS}) target_link_libraries(${ARROW_FILESYSTEM_TARGET} PRIVATE ${AWSSDK_LINK_LIBRARIES}) endforeach() + + if(ARROW_S3_MODULE) + if(NOT ARROW_BUILD_SHARED) + message(FATAL_ERROR "ARROW_S3_MODULE without shared libarrow is not supported") + endif() + + add_library(arrow_s3fs MODULE filesystem/s3fs_module.cc filesystem/s3fs.cc) + target_link_libraries(arrow_s3fs PRIVATE ${AWSSDK_LINK_LIBRARIES} arrow_shared) + set_source_files_properties(filesystem/s3fs.cc filesystem/s3fs_module.cc + PROPERTIES SKIP_PRECOMPILE_HEADERS ON + SKIP_UNITY_BUILD_INCLUSION ON) Review Comment: > Did you had any idea on how to make the module expose the missing functionality too? The filesystem module supports only construction of filesystems from URI strings (and thereafter we have access only to methods on the FileSystem base interface) so a brief enumeration of functionality missing relative to [pyarrow.fs.S3FileSystem](https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html) is - constructing the filesystem with `proxy_options` - retrieving the AWS region a filesystem is connected to - providing a retry_strategy to use when communicating with S3 It is *possible* to support all of these through URI strings alone, with some caveats: - encoding proxy options in the `s3://` URI scheme - since it already has a [URI representation](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/s3fs.h#L53) that can be could be added as a single URI parameter like `s3://u:p@mybucket/foo/bar?proxy_options=http%3A//proxy_u%3Aproxy_p@proxy_host%3A999` - alternatively independent parameters could be added for `proxy_scheme`, `proxy_username`, etc - AWS region is a URI parameter, so if we need to retrieve the region from a `FileSystem` we can convert it to a URI and read that parameter. - Region is resolved when the S3FileSystem is constructed, so it won't always be a copy of the region parameter in the original URI - Python has not bound [FileSystem::MakeUri](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/filesystem.h#L203) yet, but after it does access to the s3 region could be written like ``` def get_s3_region(fs: FileSystem) -> str|None: if fs.type_name != "s3": return None from urllib.parse import urlparse, parse_qs params = parse_qs(urlparse(fs.to_uri()).query) return params["region"][0] ``` - `S3RetryStrategy` is currently available in C++ as a [fully user-extensible interface](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/s3fs.h#L72), which is of course not specifiable in a URI parameter. However in python at least this is only exposed as the choice between standard or default strategies and a [maximum number of retries](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/python/pyarrow/_s3fs.pyx#L120). This more limited enumeration of retry strategies *can* be specified in a parameter, like `s3://u:p@mybucket/foo/bar?retry_strategy=standard/3` - Full configuration of the retry strategy interface seems like a feature which might be little used but a survey on the ML would be a good idea. If no significant use cases arise, there can be a deprecation cycle for public S3RetryStrategy Once the above changes are made, the full functionality of `pyarrow.fs.S3FileSystem` can be recovered using just `libarrow_s3fs`. At that point, the option to build `libarrow` directly linked to the AWS SDK will be redundant and can be removed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org