bkietz commented on code in PR #41559:
URL: https://github.com/apache/arrow/pull/41559#discussion_r1966301268


##########
cpp/src/arrow/CMakeLists.txt:
##########
@@ -876,6 +876,18 @@ if(ARROW_FILESYSTEM)
     foreach(ARROW_FILESYSTEM_TARGET ${ARROW_FILESYSTEM_TARGETS})
       target_link_libraries(${ARROW_FILESYSTEM_TARGET} PRIVATE 
${AWSSDK_LINK_LIBRARIES})
     endforeach()
+
+    if(ARROW_S3_MODULE)
+      if(NOT ARROW_BUILD_SHARED)
+        message(FATAL_ERROR "ARROW_S3_MODULE without shared libarrow is not 
supported")
+      endif()
+
+      add_library(arrow_s3fs MODULE filesystem/s3fs_module.cc 
filesystem/s3fs.cc)
+      target_link_libraries(arrow_s3fs PRIVATE ${AWSSDK_LINK_LIBRARIES} 
arrow_shared)
+      set_source_files_properties(filesystem/s3fs.cc filesystem/s3fs_module.cc
+                                  PROPERTIES SKIP_PRECOMPILE_HEADERS ON
+                                             SKIP_UNITY_BUILD_INCLUSION ON)

Review Comment:
   > Did you had any idea on how to make the module expose the missing 
functionality too?
   
   The filesystem module supports only construction of filesystems from URI 
strings (and thereafter we have access only to methods on the FileSystem base 
interface) so a brief enumeration of functionality missing relative to 
[pyarrow.fs.S3FileSystem](https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html)
 is
   - constructing the filesystem with `proxy_options` 
   - retrieving the AWS region a filesystem is connected to
   - providing a retry_strategy to use when communicating with S3
   
   It is *possible* to support all of these through URI strings alone, with 
some caveats:
   - encoding proxy options in the `s3://` URI scheme
     - since it already has a [URI 
representation](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/s3fs.h#L53)
 that can be could be added as a single URI parameter like 
`s3://u:p@mybucket/foo/bar?proxy_options=http%3A//proxy_u%3Aproxy_p@proxy_host%3A999`
     - alternatively independent parameters could be added for `proxy_scheme`, 
`proxy_username`, etc
   - AWS region is a URI parameter, so if we need to retrieve the region from a 
`FileSystem` we can convert it to a URI and read that parameter.
     - Region is resolved when the S3FileSystem is constructed, so it won't 
always be a copy of the region parameter in the original URI
     - Python has not bound 
[FileSystem::MakeUri](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/filesystem.h#L203)
 yet, but after it does access to the s3 region could be written like
       ```
       def get_s3_region(fs: FileSystem) -> str|None:
           if fs.type_name != "s3":
               return None
       
           from urllib.parse import urlparse, parse_qs
           params = parse_qs(urlparse(fs.to_uri()).query)
           return params["region"][0]
       ```
   - `S3RetryStrategy` is currently available in C++ as a [fully 
user-extensible 
interface](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/cpp/src/arrow/filesystem/s3fs.h#L72),
 which is of course not specifiable in a URI parameter. However in python at 
least this is only exposed as the choice between standard or default strategies 
and a [maximum number of 
retries](https://github.com/apache/arrow/blob/12cdaaafb7a5ba39db81ba6d7e565b8e7a24117e/python/pyarrow/_s3fs.pyx#L120).
 This more limited enumeration of retry strategies *can* be specified in a 
parameter, like `s3://u:p@mybucket/foo/bar?retry_strategy=standard/3`
     - Full configuration of the retry strategy interface seems like a feature 
which might be little used but a survey on the ML would be a good idea. If no 
significant use cases arise, there can be a deprecation cycle for public 
S3RetryStrategy
   
   Once the above changes are made, the full functionality of 
`pyarrow.fs.S3FileSystem` can be recovered using just `libarrow_s3fs`. At that 
point, the option to build `libarrow` directly linked to the AWS SDK will be 
redundant and can be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to