I didn't get a chance yet to really read this thread in detail but I am definitely very interested in this conversation and will make time this week to add my thoughts.
Thanks, Andy. On Sun, Sep 27, 2020, 4:01 PM Adam Lippai <a...@rigo.sk> wrote: > Hi Neville, > > yes, my concerns against common row based DB APIs is that I use > Arrow/Parquet for OLAP too. > What https://turbodbc.readthedocs.io/en/latest/ (python) or > https://github.com/pacman82/odbc-api#state (rust) does is that they read > large blocks of data instead of processing rows one-by-one, but indeed, the > ODBC and the Postgresql wire protocol is still row based. > > Clickhouse is an interesting example, as it directly supports arrow and > parquet *server-side* (I didn't try it yet, just read it in the docs). > > Best regards, > Adam Lippai > > On Sun, Sep 27, 2020 at 11:24 PM Neville Dipale <nevilled...@gmail.com> > wrote: > > > Thanks for the feedback > > > > My interest is mainly in the narrow usecase of reading and writing batch > > data, > > so I wouldn't want to deal with producing and consuming rows per se. > > Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the > > row-based or OLTP case, > > and I'm considering something more suitable for the OLAP case. > > > > @Wes I'll have a read through the Python DB API, I've also been looking > at > > JDBC > > as well as how Apache Spark manages to get such good performance from > JDBC. > > > > I haven't been an ODBC fan, but mainly because of historic struggles with > > getting it to work > > on Linux envs where I don't have system control. WIth that said, we could > > still support ODBC. > > > > @Jorge, I have an implementation at rust-dataframe ( > > > https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres) > > which uses rust-postgres. I however don't use the row-based API as that > > comes at > > a serialization cost (going from bytes > Rust types > Arrow). > > I instead use the > > Postgres binary format ( > > > > > https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204 > > ). > > That postgres module would be the starting point of such separate crate. > > > > For Postgres <> Arrow type conversions, I leverage 2 methods: > > > > 1. When reading a table, we I get schema from the *information_schema* > > system > > table > > 2. When reading a query, I issue the query with a 1-row limit, and > convert > > the row's schema to an Arrow schema > > > > @Adam I think async and pooling would be attainable yes, if an underlying > > SQL crate > > uses R2D2 for pooling, an API that supports that could be provided. > > > > In summary, I'm thinking along the lines of: > > > > * A reader that takes connection parameters & a query or table > > * The reader can handle partitioning if need be (similar to how Spark > does > > it) > > * The reader returns a Schema, and can be iterated on to return data in > > batches > > > > * A writer that takes connection parameters and a table > > * The writer writes batches to a table, and is able to write batches in > > parallel > > > > In the case of a hypothetical interfacing with column databases like > > Clickhouse, > > we would be able to levarage materialising arrows from columns, instead > of > > the > > potential column-wise conversions that can be performed from row-based > > APIs. > > > > Neville > > > > > > On Sun, 27 Sep 2020 at 22:08, Adam Lippai <a...@rigo.sk> wrote: > > > > > One more universal approach is to use ODBC, this is a recent Rust > > > conversation (with example) on the topic: > > > https://github.com/Koka/odbc-rs/issues/140 > > > > > > Honestly I find the Python DB API too simple, all it provides is a > > > row-by-row API. I miss four things: > > > > > > - Batched or bulk processing both for data loading and dumping. > > > - Async support (python has asyncio and async web frameworks, but no > > > async DB spec). SQLAlchemy async support is coming soon and there is > > > https://github.com/encode/databases > > > - Connection pooling (it's common to use TLS, connection reuse would > > be > > > nice as TLS 1.3 is not here yet) > > > - Failover / load balancing support (this is connected to the > > previous) > > > > > > Best regards, > > > Adam Lippai > > > > > > On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão < > > > jorgecarlei...@gmail.com> wrote: > > > > > > > That would be awesome! I agree with this, and would be really useful, > > as > > > it > > > > would leverage all the goodies that RDMS have wrt to transitions, > etc. > > > > > > > > I would probably go for having database-specifics outside of the > arrow > > > > project, so that they can be used by other folks beyond arrow, and > keep > > > the > > > > arrow-specifics (i.e. conversion from the format from the specific > > > > databases to arrow) as part of the arrow crate. Ideally as Wes wrote, > > > with > > > > some standard to be easier to handle different DBs. > > > > > > > > I think that there are two layers: one is how to connect to a > database, > > > the > > > > other is how to serialize/deserialize. AFAIK PEP 249 covers both > > layers, > > > as > > > > it standardizes things like `connect` and `tpc_begin`, as well as how > > > > things should be serialized to Python objects (e.g. dates should be > > > > datetime.date). This split is done by postgres for Rust > > > > <https://github.com/sfackler/rust-postgres>, as it offers 5 crates: > > > > * postges-async > > > > * postges-sync (a blocking wrapper of postgres-async) > > > > * postges-types (to convert to native rust <---- IMO this one is > what > > we > > > > want to offer in Arrow) > > > > * postges-TLS > > > > * postges-openssl > > > > > > > > `postges-sync` implements Iterator<Row> (`client.query`), and > > > postges-async > > > > implements Stream<Row>. > > > > > > > > One idea is to have a generic<T> iterator/stream adapter, that yields > > > > RecordBatches. The implementation of this trait by different > providers > > > > would give support to be used in Arrow and DataFusion. > > > > > > > > Besides postgres, one idea is to pick the top from this list > > > > <https://db-engines.com/en/ranking>: > > > > > > > > * Oracle > > > > * MySQL > > > > * MsSQL > > > > > > > > Another idea is to start by by supporting SQLite, which is a good > > > > development env to work with relational databases. > > > > > > > > Best, > > > > Jorge > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale < > nevilled...@gmail.com> > > > > wrote: > > > > > > > > > Hi Arrow developers > > > > > > > > > > I would like to gauge the appetite for an Arrow SQL connector that: > > > > > > > > > > * Reads and writes Arrow data to and from SQL databases > > > > > * Reads tables and queries into record batches, and writes batches > to > > > > > tables (either append or overwrite) > > > > > * Leverages binary SQL formats where available (e.g. PostgreSQL > > format > > > is > > > > > relatively easy and well-documented) > > > > > * Provides a batch interface that abstracts away the different > > database > > > > > semantics, and exposes a RecordBatchReader ( > > > > > > > > > > > > > > > https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html > > > > > ), > > > > > and perhaps a RecordBatchWriter > > > > > * Resides in the Rust repo as either an arrow::sql module (like > > > > arrow::csv, > > > > > arrow::json, arrow::ipc) or alternatively is a separate crate in > the > > > > > workspace (*arrow-sql*?) > > > > > > > > > > I would be able to contribute a Postgres reader/writer as a start. > > > > > I could make this a separate crate, but to drive adoption I would > > > prefer > > > > > this living in Arrow, also it can remain updated (sometimes we > > > reorganise > > > > > modules and end up breaking dependencies). > > > > > > > > > > Also, being developed next to DataFusion could allow DF to support > > SQL > > > > > databases, as this would be yet another datasource. > > > > > > > > > > Some questions: > > > > > * Should such library support async, sync or both IO methods? > > > > > * Other than postgres, what other databases would be interesting? > > Here > > > > I'm > > > > > hoping that once we've established a suitable API, it could be > easier > > > to > > > > > natively support more database types. > > > > > > > > > > Potential concerns: > > > > > > > > > > * Sparse database support > > > > > It's a lot of effort to write database connectors, especially if > > > starting > > > > > from scratch (unlike with say JDBC). What if we end up supporting 1 > > or > > > 2 > > > > > database servers? > > > > > Perhaps in that case we could keep the module without publishing it > > to > > > > > crates.io until we're happy with database support, or even its > > usage. > > > > > > > > > > * Dependency bloat > > > > > We could feature-gate database types to reduce the number of > > > dependencies > > > > > if one only wants certain DB connectors > > > > > > > > > > * Why not use Java's JDBC adapter? > > > > > I already do this, but sometimes if working on a Rust project, > > > creating a > > > > > separate JVM service solely to extract Arrow data is a lot of > effort. > > > > > I also don't think it's currently possible to use the adapter to > save > > > > Arrow > > > > > data in a database. > > > > > > > > > > * What about Flight SQL extensions? > > > > > There have been discussions around creating Flight SQL extensions, > > and > > > > the > > > > > Rust SQL adapter could implement that and co-exist well. > > > > > From a crate dependency, *arrow-flight* depends on *arrow*, so it > > could > > > > > also depend on this *arrow-sql* crate. > > > > > > > > > > Please let me know what you think > > > > > > > > > > Regards > > > > > Neville > > > > > > > > > > > > > > >