rdettai edited a comment on pull request #1072:
URL: https://github.com/apache/arrow-datafusion/pull/1072#issuecomment-937593685
I opened this PR to formalize the discussion, sorry if it brought in more
confusion than clarity 😃. Let me try to summarize the full context.
- The challenge comes from the fact that we would want to be able to plug in
arbitrary object stores (possibly from external repositories as discussed in
#907)
- **For a local user of the Datafusion lib**, this can be done by attaching
the `ObjectStoreRegistry` to the `ExecutionContext`, and then registering any
`ObjectStore` implementation to the context.
- **For a distributed user of Datafusion (like ballista)**, things are a bit
more complicated:
- (1) either we keep sourcing the `ObjectStore` instances from an
`ObjectStoreRegistry`. The registry will need to be initialized coherently in
each component of the application (in Ballista that means the executor and the
scheduler), and the right plumbing needs to be added to have that registry
instance readily available in each part where an `ObjectStore` instance is
required. This is particularly annoying for the serde components which are not
attached to the `ExecutionContext` in any way. A static (if possible immutable)
instance of `ObjectStoreRegistry` helps solving these issues: coherent registry
+ simplified plumbing.
- (2) or we define at the serde level what object stores are available.
Just as with the `TableProvider` and `ExecutionPlan` traits, where the Ballista
serialization only recognizes a given list of implementations (namely CSV and
Parquet), the scheduler / executor will instantiate object stores directly
without the `ObjectStoreRegistry` (the serde calls `ObjectStoreFoo::new`
directly). This mean we won't allow to add new object stores through the
context in Ballista.
If I understand correctly, @houqp is leaning toward solution (2). Even
though it lacks flexibility, it uses the same mechanism Ballista is already
using for table providers, so I would also go with it.
So with solution (2) the Ballista flow would be:
- the object store is resolved from the `ObjectStoreRegistry` in the
Datafusion context on the **Ballista client**
- we try to serialize the logical plan
- if the object store is not known by the serde: error
- if it is known, the object store is serialized into a simple `enum` (we
don't serialize further config such as credentials)
- the deserialization calls `ObjectStoreFoo::new` directly, which might pick
up configuration from its environment
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]