Re: [Rust] Arrow SQL Adapters/Connectors

Andy Grove Sun, 27 Sep 2020 18:50:43 -0700

I didn't get a chance yet to really read this thread in detail but I am
definitely very interested in this conversation and will make time this
week to add my thoughts.


Thanks,

Andy.

On Sun, Sep 27, 2020, 4:01 PM Adam Lippai <a...@rigo.sk> wrote:

> Hi Neville,
>
> yes, my concerns against common row based DB APIs is that I use
> Arrow/Parquet for OLAP too.
> What https://turbodbc.readthedocs.io/en/latest/ (python) or
> https://github.com/pacman82/odbc-api#state (rust) does is that they read
> large blocks of data instead of processing rows one-by-one, but indeed, the
> ODBC and the Postgresql wire protocol is still row based.
>
> Clickhouse is an interesting example, as it directly supports arrow and
> parquet *server-side* (I didn't try it yet, just read it in the docs).
>
> Best regards,
> Adam Lippai
>
> On Sun, Sep 27, 2020 at 11:24 PM Neville Dipale <nevilled...@gmail.com>
> wrote:
>
> > Thanks for the feedback
> >
> > My interest is mainly in the narrow usecase of reading and writing batch
> > data,
> > so I wouldn't want to deal with producing and consuming rows per se.
> > Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the
> > row-based or OLTP case,
> > and I'm considering something more suitable for the OLAP case.
> >
> > @Wes I'll have a read through the Python DB API, I've also been looking
> at
> > JDBC
> > as well as how Apache Spark manages to get such good performance from
> JDBC.
> >
> > I haven't been an ODBC fan, but mainly because of historic struggles with
> > getting it to work
> > on Linux envs where I don't have system control. WIth that said, we could
> > still support ODBC.
> >
> > @Jorge, I have an implementation at rust-dataframe (
> >
> https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres)
> > which uses rust-postgres. I however don't use the row-based API as that
> > comes at
> > a serialization cost (going from bytes > Rust types > Arrow).
> > I instead use the
> > Postgres binary format (
> >
> >
> https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204
> > ).
> > That postgres module would be the starting point of such separate crate.
> >
> > For Postgres <> Arrow type conversions, I leverage 2 methods:
> >
> > 1. When reading a table, we I get schema from the *information_schema*
> > system
> > table
> > 2. When reading a query, I issue the query with a 1-row limit, and
> convert
> > the row's schema to an Arrow schema
> >
> > @Adam I think async and pooling would be attainable yes, if an underlying
> > SQL crate
> > uses R2D2 for pooling, an API that supports that could be provided.
> >
> > In summary, I'm thinking along the lines of:
> >
> > * A reader that takes connection parameters & a query or table
> > * The reader can handle partitioning if need be (similar to how Spark
> does
> > it)
> > * The reader returns a Schema, and can be iterated on to return data in
> > batches
> >
> > * A writer that takes connection parameters and a table
> > * The writer writes batches to a table, and is able to write batches in
> > parallel
> >
> > In the case of a hypothetical interfacing with column databases like
> > Clickhouse,
> > we would be able to levarage materialising arrows from columns, instead
> of
> > the
> > potential column-wise conversions that can be performed from row-based
> > APIs.
> >
> > Neville
> >
> >
> > On Sun, 27 Sep 2020 at 22:08, Adam Lippai <a...@rigo.sk> wrote:
> >
> > > One more universal approach is to use ODBC, this is a recent Rust
> > > conversation (with example) on the topic:
> > > https://github.com/Koka/odbc-rs/issues/140
> > >
> > > Honestly I find the Python DB API too simple, all it provides is a
> > > row-by-row API. I miss four things:
> > >
> > >    - Batched or bulk processing both for data loading and dumping.
> > >    - Async support (python has asyncio and async web frameworks, but no
> > >    async DB spec). SQLAlchemy async support is coming soon and there is
> > >    https://github.com/encode/databases
> > >    - Connection pooling (it's common to use TLS, connection reuse would
> > be
> > >    nice as TLS 1.3 is not here yet)
> > >    - Failover / load balancing support (this is connected to the
> > previous)
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > > On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > That would be awesome! I agree with this, and would be really useful,
> > as
> > > it
> > > > would leverage all the goodies that RDMS have wrt to transitions,
> etc.
> > > >
> > > > I would probably go for having database-specifics outside of the
> arrow
> > > > project, so that they can be used by other folks beyond arrow, and
> keep
> > > the
> > > > arrow-specifics (i.e. conversion from the format from the specific
> > > > databases to arrow) as part of the arrow crate. Ideally as Wes wrote,
> > > with
> > > > some standard to be easier to handle different DBs.
> > > >
> > > > I think that there are two layers: one is how to connect to a
> database,
> > > the
> > > > other is how to serialize/deserialize. AFAIK PEP 249 covers both
> > layers,
> > > as
> > > > it standardizes things like `connect` and `tpc_begin`, as well as how
> > > > things should be serialized to Python objects (e.g. dates should be
> > > > datetime.date). This split is done by postgres for Rust
> > > > <https://github.com/sfackler/rust-postgres>, as it offers 5 crates:
> > > > * postges-async
> > > > * postges-sync (a blocking wrapper of postgres-async)
> > > > * postges-types (to convert to native rust  <---- IMO this one is
> what
> > we
> > > > want to offer in Arrow)
> > > > * postges-TLS
> > > > * postges-openssl
> > > >
> > > > `postges-sync` implements Iterator<Row> (`client.query`), and
> > > postges-async
> > > > implements Stream<Row>.
> > > >
> > > > One idea is to have a generic<T> iterator/stream adapter, that yields
> > > > RecordBatches. The implementation of this trait by different
> providers
> > > > would give support to be used in Arrow and DataFusion.
> > > >
> > > > Besides postgres, one idea is to pick the top from this list
> > > > <https://db-engines.com/en/ranking>:
> > > >
> > > > * Oracle
> > > > * MySQL
> > > > * MsSQL
> > > >
> > > > Another idea is to start by by supporting SQLite, which is a good
> > > > development env to work with relational databases.
> > > >
> > > > Best,
> > > > Jorge
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale <
> nevilled...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Arrow developers
> > > > >
> > > > > I would like to gauge the appetite for an Arrow SQL connector that:
> > > > >
> > > > > * Reads and writes Arrow data to and from SQL databases
> > > > > * Reads tables and queries into record batches, and writes batches
> to
> > > > > tables (either append or overwrite)
> > > > > * Leverages binary SQL formats where available (e.g. PostgreSQL
> > format
> > > is
> > > > > relatively easy and well-documented)
> > > > > * Provides a batch interface that abstracts away the different
> > database
> > > > > semantics, and exposes a RecordBatchReader (
> > > > >
> > > >
> > >
> >
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
> > > > > ),
> > > > > and perhaps a RecordBatchWriter
> > > > > * Resides in the Rust repo as either an arrow::sql module (like
> > > > arrow::csv,
> > > > > arrow::json, arrow::ipc) or alternatively is a separate crate in
> the
> > > > > workspace  (*arrow-sql*?)
> > > > >
> > > > > I would be able to contribute a Postgres reader/writer as a start.
> > > > > I could make this a separate crate, but to drive adoption I would
> > > prefer
> > > > > this living in Arrow, also it can remain updated (sometimes we
> > > reorganise
> > > > > modules and end up breaking dependencies).
> > > > >
> > > > > Also, being developed next to DataFusion could allow DF to support
> > SQL
> > > > > databases, as this would be yet another datasource.
> > > > >
> > > > > Some questions:
> > > > > * Should such library support async, sync or both IO methods?
> > > > > * Other than postgres, what other databases would be interesting?
> > Here
> > > > I'm
> > > > > hoping that once we've established a suitable API, it could be
> easier
> > > to
> > > > > natively support more database types.
> > > > >
> > > > > Potential concerns:
> > > > >
> > > > > * Sparse database support
> > > > > It's a lot of effort to write database connectors, especially if
> > > starting
> > > > > from scratch (unlike with say JDBC). What if we end up supporting 1
> > or
> > > 2
> > > > > database servers?
> > > > > Perhaps in that case we could keep the module without publishing it
> > to
> > > > > crates.io until we're happy with database support, or even its
> > usage.
> > > > >
> > > > > * Dependency bloat
> > > > > We could feature-gate database types to reduce the number of
> > > dependencies
> > > > > if one only wants certain DB connectors
> > > > >
> > > > > * Why not use Java's JDBC adapter?
> > > > > I already do this, but sometimes if working on a Rust project,
> > > creating a
> > > > > separate JVM service solely to extract Arrow data is a lot of
> effort.
> > > > > I also don't think it's currently possible to use the adapter to
> save
> > > > Arrow
> > > > > data in a database.
> > > > >
> > > > > * What about Flight SQL extensions?
> > > > > There have been discussions around creating Flight SQL extensions,
> > and
> > > > the
> > > > > Rust SQL adapter could implement that and co-exist well.
> > > > > From a crate dependency, *arrow-flight* depends on *arrow*, so it
> > could
> > > > > also depend on this *arrow-sql* crate.
> > > > >
> > > > > Please let me know what you think
> > > > >
> > > > > Regards
> > > > > Neville
> > > > >
> > > >
> > >
> >
>

Re: [Rust] Arrow SQL Adapters/Connectors

Reply via email to