Hi Neville, yes, my concerns against common row based DB APIs is that I use Arrow/Parquet for OLAP too. What https://turbodbc.readthedocs.io/en/latest/ (python) or https://github.com/pacman82/odbc-api#state (rust) does is that they read large blocks of data instead of processing rows one-by-one, but indeed, the ODBC and the Postgresql wire protocol is still row based.
Clickhouse is an interesting example, as it directly supports arrow and parquet *server-side* (I didn't try it yet, just read it in the docs). Best regards, Adam Lippai On Sun, Sep 27, 2020 at 11:24 PM Neville Dipale <nevilled...@gmail.com> wrote: > Thanks for the feedback > > My interest is mainly in the narrow usecase of reading and writing batch > data, > so I wouldn't want to deal with producing and consuming rows per se. > Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the > row-based or OLTP case, > and I'm considering something more suitable for the OLAP case. > > @Wes I'll have a read through the Python DB API, I've also been looking at > JDBC > as well as how Apache Spark manages to get such good performance from JDBC. > > I haven't been an ODBC fan, but mainly because of historic struggles with > getting it to work > on Linux envs where I don't have system control. WIth that said, we could > still support ODBC. > > @Jorge, I have an implementation at rust-dataframe ( > https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres) > which uses rust-postgres. I however don't use the row-based API as that > comes at > a serialization cost (going from bytes > Rust types > Arrow). > I instead use the > Postgres binary format ( > > https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204 > ). > That postgres module would be the starting point of such separate crate. > > For Postgres <> Arrow type conversions, I leverage 2 methods: > > 1. When reading a table, we I get schema from the *information_schema* > system > table > 2. When reading a query, I issue the query with a 1-row limit, and convert > the row's schema to an Arrow schema > > @Adam I think async and pooling would be attainable yes, if an underlying > SQL crate > uses R2D2 for pooling, an API that supports that could be provided. > > In summary, I'm thinking along the lines of: > > * A reader that takes connection parameters & a query or table > * The reader can handle partitioning if need be (similar to how Spark does > it) > * The reader returns a Schema, and can be iterated on to return data in > batches > > * A writer that takes connection parameters and a table > * The writer writes batches to a table, and is able to write batches in > parallel > > In the case of a hypothetical interfacing with column databases like > Clickhouse, > we would be able to levarage materialising arrows from columns, instead of > the > potential column-wise conversions that can be performed from row-based > APIs. > > Neville > > > On Sun, 27 Sep 2020 at 22:08, Adam Lippai <a...@rigo.sk> wrote: > > > One more universal approach is to use ODBC, this is a recent Rust > > conversation (with example) on the topic: > > https://github.com/Koka/odbc-rs/issues/140 > > > > Honestly I find the Python DB API too simple, all it provides is a > > row-by-row API. I miss four things: > > > > - Batched or bulk processing both for data loading and dumping. > > - Async support (python has asyncio and async web frameworks, but no > > async DB spec). SQLAlchemy async support is coming soon and there is > > https://github.com/encode/databases > > - Connection pooling (it's common to use TLS, connection reuse would > be > > nice as TLS 1.3 is not here yet) > > - Failover / load balancing support (this is connected to the > previous) > > > > Best regards, > > Adam Lippai > > > > On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão < > > jorgecarlei...@gmail.com> wrote: > > > > > That would be awesome! I agree with this, and would be really useful, > as > > it > > > would leverage all the goodies that RDMS have wrt to transitions, etc. > > > > > > I would probably go for having database-specifics outside of the arrow > > > project, so that they can be used by other folks beyond arrow, and keep > > the > > > arrow-specifics (i.e. conversion from the format from the specific > > > databases to arrow) as part of the arrow crate. Ideally as Wes wrote, > > with > > > some standard to be easier to handle different DBs. > > > > > > I think that there are two layers: one is how to connect to a database, > > the > > > other is how to serialize/deserialize. AFAIK PEP 249 covers both > layers, > > as > > > it standardizes things like `connect` and `tpc_begin`, as well as how > > > things should be serialized to Python objects (e.g. dates should be > > > datetime.date). This split is done by postgres for Rust > > > <https://github.com/sfackler/rust-postgres>, as it offers 5 crates: > > > * postges-async > > > * postges-sync (a blocking wrapper of postgres-async) > > > * postges-types (to convert to native rust <---- IMO this one is what > we > > > want to offer in Arrow) > > > * postges-TLS > > > * postges-openssl > > > > > > `postges-sync` implements Iterator<Row> (`client.query`), and > > postges-async > > > implements Stream<Row>. > > > > > > One idea is to have a generic<T> iterator/stream adapter, that yields > > > RecordBatches. The implementation of this trait by different providers > > > would give support to be used in Arrow and DataFusion. > > > > > > Besides postgres, one idea is to pick the top from this list > > > <https://db-engines.com/en/ranking>: > > > > > > * Oracle > > > * MySQL > > > * MsSQL > > > > > > Another idea is to start by by supporting SQLite, which is a good > > > development env to work with relational databases. > > > > > > Best, > > > Jorge > > > > > > > > > > > > > > > > > > On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale <nevilled...@gmail.com> > > > wrote: > > > > > > > Hi Arrow developers > > > > > > > > I would like to gauge the appetite for an Arrow SQL connector that: > > > > > > > > * Reads and writes Arrow data to and from SQL databases > > > > * Reads tables and queries into record batches, and writes batches to > > > > tables (either append or overwrite) > > > > * Leverages binary SQL formats where available (e.g. PostgreSQL > format > > is > > > > relatively easy and well-documented) > > > > * Provides a batch interface that abstracts away the different > database > > > > semantics, and exposes a RecordBatchReader ( > > > > > > > > > > https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html > > > > ), > > > > and perhaps a RecordBatchWriter > > > > * Resides in the Rust repo as either an arrow::sql module (like > > > arrow::csv, > > > > arrow::json, arrow::ipc) or alternatively is a separate crate in the > > > > workspace (*arrow-sql*?) > > > > > > > > I would be able to contribute a Postgres reader/writer as a start. > > > > I could make this a separate crate, but to drive adoption I would > > prefer > > > > this living in Arrow, also it can remain updated (sometimes we > > reorganise > > > > modules and end up breaking dependencies). > > > > > > > > Also, being developed next to DataFusion could allow DF to support > SQL > > > > databases, as this would be yet another datasource. > > > > > > > > Some questions: > > > > * Should such library support async, sync or both IO methods? > > > > * Other than postgres, what other databases would be interesting? > Here > > > I'm > > > > hoping that once we've established a suitable API, it could be easier > > to > > > > natively support more database types. > > > > > > > > Potential concerns: > > > > > > > > * Sparse database support > > > > It's a lot of effort to write database connectors, especially if > > starting > > > > from scratch (unlike with say JDBC). What if we end up supporting 1 > or > > 2 > > > > database servers? > > > > Perhaps in that case we could keep the module without publishing it > to > > > > crates.io until we're happy with database support, or even its > usage. > > > > > > > > * Dependency bloat > > > > We could feature-gate database types to reduce the number of > > dependencies > > > > if one only wants certain DB connectors > > > > > > > > * Why not use Java's JDBC adapter? > > > > I already do this, but sometimes if working on a Rust project, > > creating a > > > > separate JVM service solely to extract Arrow data is a lot of effort. > > > > I also don't think it's currently possible to use the adapter to save > > > Arrow > > > > data in a database. > > > > > > > > * What about Flight SQL extensions? > > > > There have been discussions around creating Flight SQL extensions, > and > > > the > > > > Rust SQL adapter could implement that and co-exist well. > > > > From a crate dependency, *arrow-flight* depends on *arrow*, so it > could > > > > also depend on this *arrow-sql* crate. > > > > > > > > Please let me know what you think > > > > > > > > Regards > > > > Neville > > > > > > > > > >