rdettai edited a comment on pull request #1072:
URL: https://github.com/apache/arrow-datafusion/pull/1072#issuecomment-937593685


   I opened this PR to formalize the discussion, sorry if it brought in more 
confusion than clarity 😃. Let me try to summarize the full context.
   
   - The challenge comes from the fact that we would want to be able to plug in 
arbitrary object stores (possibly from external repositories as discussed in 
#907)
   - **For a local user of the Datafusion lib**, this can be done by attaching 
the `ObjectStoreRegistry` to the `ExecutionContext`, and then registering any 
`ObjectStore` implementation to the context.
   - **For a distributed user of Datafusion (like ballista)**, things are a bit 
more complicated: 
      - (1) either we keep sourcing the `ObjectStore` instances from an 
`ObjectStoreRegistry`. The registry will need to be initialized coherently in 
each component of the application (in Ballista that means the executor and the 
scheduler), and the right plumbing needs to be added to have that registry 
instance readily available in each part where an `ObjectStore` instance is 
required. This is particularly annoying for the serde components which are not 
attached to the `ExecutionContext` in any way. A static (if possible immutable) 
instance of `ObjectStoreRegistry` helps solving these issues: coherent registry 
+ simplified plumbing.
      - (2) either we define at the serde level what object stores are 
available. Just as with the `TableProvider` and `ExecutionPlan` traits, where 
the Ballista serialization only recognizes a given list of implementations 
(namely CSV and Parquet), the scheduler / executor will instantiate object 
stores directly without the `ObjectStoreRegistry` (the serde calls 
`ObjectStoreFoo::new` directly). This mean we won't allow to add new object 
stores through the context in Ballista.
   
   If I understand correctly, @houqp is leaning toward solution (2). Even 
though it lacks flexibility, it uses the same mechanism Ballista is already 
using for table providers, so I would also go with it.
   
   So with solution (2) the Ballista flow would be:
   - the object store is resolved from the `ObjectStoreRegistry` in the 
Datafusion context on the **Ballista client**
   - we try to serialize the logical plan
     - if the object store is not known by the serde: error
     - if it is known, the object store is serialized into a simple `enum` (we 
don't serialize further config such as credentials)
   - the deserialization calls `ObjectStoreFoo::new` directly, which might pick 
up configuration from its environment
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to