[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #811: Add support for reading remote storage systems

GitBox Thu, 19 Aug 2021 00:34:41 -0700


yjshen edited a comment on pull request #811:
URL: https://github.com/apache/arrow-datafusion/pull/811#issuecomment-901679869



   @alamb @andygrove  @Dandandan @jorgecarleitao @rdettai  On making the remote 
storage system object listing & data reading API async, a design choice occurs. 
This might be quite important, and I'd love to have your suggestions:
   
   ### To which level should I propagate async?  
   
   This was because once we have async dir listing -> we can have async logical 
plans & async table provider ->  we can have async DataFrame / context API
   
   Two available alternatives are:
   
   1. Limit async to just `listing` / `metadata_fetch` /  file `read`, wrap a 
sync version over these async and keep most of the user-facing API untouched. 
(keep the PR lean as possible)
   2. Propogate Async API all the way up and finally change the user-facing 
API: including DataFrame & ExecutionContext. (which includes huge user-facing 
API changes ).
   
   Currently, This PR took the first approach by constructing all APIs in 
`ObjectStore` / `ObjectReader` /  `SourceRootDescriptor` natively in async and 
wrap the async function to a sync one. Trying to keep other parts of the 
project untouched. Great thanks to @houqp for guiding me through the way.
   
   Does approach 1 make sense to you? 
   
   ### If I take approach 1, how should the sync version function be 
constructed?
   
   This PR tries to make a wrapper over the async counterparts and keep single 
logic for each functionality. therefore relies on `futures::executor::block_on` 
to bridge async to sync function. 
   
   However, this approach is flawed for `block_on` may block the only thread in 
tokio, and the future inside won't get a chance to run, therefore hanging 
forever if the tokio runtime is not a multi-threaded one.  (I temporarily 
change the related test to use `#[tokio::test(flavor = "multi_thread", 
worker_threads = 2)]` to avoid hanging). Do you have any suggestions on this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #811: Add support for reading remote storage systems

Reply via email to