> And probably some more I don't think of currently. I think this is useful > work as it also would enable other "extensions" to work in a similar way
I 100% agree On Wed, Jun 9, 2021 at 2:30 PM Daniël Heres <danielhe...@gmail.com> wrote: > Thanks all for the valuable input! > > I agree following the plugin / model makes a lot of sense for now (either > in arrow-datafusion repo or somewhere external, for example in delta-rs if > we're OK it not being part of Apache right now). > > In order to support certain Delta Lake features including SQL syntax we > probably need to do make DataFusion a bit more extensible besides what is > currently possible with the TableProvider, for example: > > * Allow registering a custom data format (for supporting things like > *create > external table t stored as parquet*) > * Allow parsing and/or handling custom SQL syntax like *optimize* / > *vacuum* / *select * from t version as of n* , etc. > > And probably some more I don't think of currently. I think this is useful > work as it also would enable other "extensions" to work in a similar way > (e.g. Apache Iceberg and other formats / readers / writers / syntax) and > make DataFusion a more flexible engine. > > Best, Daniël > > Op wo 9 jun. 2021 om 20:07 schreef Neville Dipale <nevilled...@gmail.com>: > > > The correct approach might be to improve DataFusion support in > > delta-rs. TableProvider is already implemented here: > > > https://github.com/delta-io/delta-rs/blob/main/rust/src/delta_datafusion.rs > > > > I've pinged QP to ask for their advice. > > > > Neville > > > > On Wed, 9 Jun 2021 at 19:58, Andrew Lamb <al...@influxdata.com> wrote: > > > > > I think the idea of DataFusion + DeltaLake is quite compelling and > likely > > > useful. > > > > > > However, I think DataFusion is ideally an "embeddable query engine" > > rather > > > than a database system in itself, so in that mental model Delta Lake > > > integration belongs somewhere other than the core DataFusion crate. > > > > > > My ideal structure would be a new crate (maybe not even part of the > > Apache > > > Arrow Project), perhaps called `datafusion-delta-rs`, that contained > the > > > TableProvider and whatever else was needed to integrate DataFusion with > > > DeltaLake > > > > > > This structure could also start a pattern of publishing plugins for > > > DataFusion separately from the core. > > > > > > Andrew > > > p.s. now that Arrow is publishing more incrementally (e.g. 4.1.0, > 4.2.0, > > > etc), I think delta-rs[1] and datafusion both only specify `4.x` so > they > > > should work together nicely > > > > > > https://github.com/delta-io/delta-rs/blame/main/rust/Cargo.toml > > > > > > On Wed, Jun 9, 2021 at 2:29 AM Daniël Heres <danielhe...@gmail.com> > > wrote: > > > > > > > Hi all, > > > > > > > > I would like to receive some feedback about adding Delta Lake support > > to > > > > DataFusion (https://github.com/apache/arrow-datafusion/issues/525). > > > > As you might know, Delta Lake <https://delta.io/> is a format adding > > > > features like ACID transactions, statistics, and storage optimization > > to > > > > Parquet and is getting quite some traction for managing data lakes. > > > > It seems a great feature to have in DataFusion as well. > > > > > > > > The delta-rs <https://github.com/delta-io/delta-rs> project > provides a > > > > native, Apache licensed, Rust implementation of Delta Lake, already > > > > supporting a large part of the format and operations. > > > > > > > > The first integration I would like to propose is adding read support > > via > > > a > > > > new TableProvider. There might be some work to do around dependencies > > as > > > > both DataFusion and delta-rs rely on (certain versions of) Arrow and > > > > Parquet. > > > > > > > > Let me know if you have any further ideas or concerns. > > > > > > > > Best regards, > > > > > > > > Daniël Heres > > > > > > > > > > > > -- > Daniël Heres >