Re: [Rust] Arrow SQL Adapters/Connectors

Adam Lippai Sun, 27 Sep 2020 15:01:44 -0700

Hi Neville,

yes, my concerns against common row based DB APIs is that I use
Arrow/Parquet for OLAP too.
What https://turbodbc.readthedocs.io/en/latest/ (python) or
https://github.com/pacman82/odbc-api#state (rust) does is that they read
large blocks of data instead of processing rows one-by-one, but indeed, the
ODBC and the Postgresql wire protocol is still row based.


Clickhouse is an interesting example, as it directly supports arrow and
parquet *server-side* (I didn't try it yet, just read it in the docs).

Best regards,
Adam Lippai

On Sun, Sep 27, 2020 at 11:24 PM Neville Dipale <nevilled...@gmail.com>
wrote:

> Thanks for the feedback
>
> My interest is mainly in the narrow usecase of reading and writing batch
> data,
> so I wouldn't want to deal with producing and consuming rows per se.
> Andy has worked on RDBC (https://github.com/tokio-rs/rdbc) for the
> row-based or OLTP case,
> and I'm considering something more suitable for the OLAP case.
>
> @Wes I'll have a read through the Python DB API, I've also been looking at
> JDBC
> as well as how Apache Spark manages to get such good performance from JDBC.
>
> I haven't been an ODBC fan, but mainly because of historic struggles with
> getting it to work
> on Linux envs where I don't have system control. WIth that said, we could
> still support ODBC.
>
> @Jorge, I have an implementation at rust-dataframe (
> https://github.com/nevi-me/rust-dataframe/tree/master/src/io/sql/postgres)
> which uses rust-postgres. I however don't use the row-based API as that
> comes at
> a serialization cost (going from bytes > Rust types > Arrow).
> I instead use the
> Postgres binary format (
>
> https://github.com/nevi-me/rust-dataframe/blob/master/src/io/sql/postgres/reader.rs#L204
> ).
> That postgres module would be the starting point of such separate crate.
>
> For Postgres <> Arrow type conversions, I leverage 2 methods:
>
> 1. When reading a table, we I get schema from the *information_schema*
> system
> table
> 2. When reading a query, I issue the query with a 1-row limit, and convert
> the row's schema to an Arrow schema
>
> @Adam I think async and pooling would be attainable yes, if an underlying
> SQL crate
> uses R2D2 for pooling, an API that supports that could be provided.
>
> In summary, I'm thinking along the lines of:
>
> * A reader that takes connection parameters & a query or table
> * The reader can handle partitioning if need be (similar to how Spark does
> it)
> * The reader returns a Schema, and can be iterated on to return data in
> batches
>
> * A writer that takes connection parameters and a table
> * The writer writes batches to a table, and is able to write batches in
> parallel
>
> In the case of a hypothetical interfacing with column databases like
> Clickhouse,
> we would be able to levarage materialising arrows from columns, instead of
> the
> potential column-wise conversions that can be performed from row-based
> APIs.
>
> Neville
>
>
> On Sun, 27 Sep 2020 at 22:08, Adam Lippai <a...@rigo.sk> wrote:
>
> > One more universal approach is to use ODBC, this is a recent Rust
> > conversation (with example) on the topic:
> > https://github.com/Koka/odbc-rs/issues/140
> >
> > Honestly I find the Python DB API too simple, all it provides is a
> > row-by-row API. I miss four things:
> >
> >    - Batched or bulk processing both for data loading and dumping.
> >    - Async support (python has asyncio and async web frameworks, but no
> >    async DB spec). SQLAlchemy async support is coming soon and there is
> >    https://github.com/encode/databases
> >    - Connection pooling (it's common to use TLS, connection reuse would
> be
> >    nice as TLS 1.3 is not here yet)
> >    - Failover / load balancing support (this is connected to the
> previous)
> >
> > Best regards,
> > Adam Lippai
> >
> > On Sun, Sep 27, 2020 at 9:57 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > That would be awesome! I agree with this, and would be really useful,
> as
> > it
> > > would leverage all the goodies that RDMS have wrt to transitions, etc.
> > >
> > > I would probably go for having database-specifics outside of the arrow
> > > project, so that they can be used by other folks beyond arrow, and keep
> > the
> > > arrow-specifics (i.e. conversion from the format from the specific
> > > databases to arrow) as part of the arrow crate. Ideally as Wes wrote,
> > with
> > > some standard to be easier to handle different DBs.
> > >
> > > I think that there are two layers: one is how to connect to a database,
> > the
> > > other is how to serialize/deserialize. AFAIK PEP 249 covers both
> layers,
> > as
> > > it standardizes things like `connect` and `tpc_begin`, as well as how
> > > things should be serialized to Python objects (e.g. dates should be
> > > datetime.date). This split is done by postgres for Rust
> > > <https://github.com/sfackler/rust-postgres>, as it offers 5 crates:
> > > * postges-async
> > > * postges-sync (a blocking wrapper of postgres-async)
> > > * postges-types (to convert to native rust  <---- IMO this one is what
> we
> > > want to offer in Arrow)
> > > * postges-TLS
> > > * postges-openssl
> > >
> > > `postges-sync` implements Iterator<Row> (`client.query`), and
> > postges-async
> > > implements Stream<Row>.
> > >
> > > One idea is to have a generic<T> iterator/stream adapter, that yields
> > > RecordBatches. The implementation of this trait by different providers
> > > would give support to be used in Arrow and DataFusion.
> > >
> > > Besides postgres, one idea is to pick the top from this list
> > > <https://db-engines.com/en/ranking>:
> > >
> > > * Oracle
> > > * MySQL
> > > * MsSQL
> > >
> > > Another idea is to start by by supporting SQLite, which is a good
> > > development env to work with relational databases.
> > >
> > > Best,
> > > Jorge
> > >
> > >
> > >
> > >
> > >
> > > On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale <nevilled...@gmail.com>
> > > wrote:
> > >
> > > > Hi Arrow developers
> > > >
> > > > I would like to gauge the appetite for an Arrow SQL connector that:
> > > >
> > > > * Reads and writes Arrow data to and from SQL databases
> > > > * Reads tables and queries into record batches, and writes batches to
> > > > tables (either append or overwrite)
> > > > * Leverages binary SQL formats where available (e.g. PostgreSQL
> format
> > is
> > > > relatively easy and well-documented)
> > > > * Provides a batch interface that abstracts away the different
> database
> > > > semantics, and exposes a RecordBatchReader (
> > > >
> > >
> >
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
> > > > ),
> > > > and perhaps a RecordBatchWriter
> > > > * Resides in the Rust repo as either an arrow::sql module (like
> > > arrow::csv,
> > > > arrow::json, arrow::ipc) or alternatively is a separate crate in the
> > > > workspace  (*arrow-sql*?)
> > > >
> > > > I would be able to contribute a Postgres reader/writer as a start.
> > > > I could make this a separate crate, but to drive adoption I would
> > prefer
> > > > this living in Arrow, also it can remain updated (sometimes we
> > reorganise
> > > > modules and end up breaking dependencies).
> > > >
> > > > Also, being developed next to DataFusion could allow DF to support
> SQL
> > > > databases, as this would be yet another datasource.
> > > >
> > > > Some questions:
> > > > * Should such library support async, sync or both IO methods?
> > > > * Other than postgres, what other databases would be interesting?
> Here
> > > I'm
> > > > hoping that once we've established a suitable API, it could be easier
> > to
> > > > natively support more database types.
> > > >
> > > > Potential concerns:
> > > >
> > > > * Sparse database support
> > > > It's a lot of effort to write database connectors, especially if
> > starting
> > > > from scratch (unlike with say JDBC). What if we end up supporting 1
> or
> > 2
> > > > database servers?
> > > > Perhaps in that case we could keep the module without publishing it
> to
> > > > crates.io until we're happy with database support, or even its
> usage.
> > > >
> > > > * Dependency bloat
> > > > We could feature-gate database types to reduce the number of
> > dependencies
> > > > if one only wants certain DB connectors
> > > >
> > > > * Why not use Java's JDBC adapter?
> > > > I already do this, but sometimes if working on a Rust project,
> > creating a
> > > > separate JVM service solely to extract Arrow data is a lot of effort.
> > > > I also don't think it's currently possible to use the adapter to save
> > > Arrow
> > > > data in a database.
> > > >
> > > > * What about Flight SQL extensions?
> > > > There have been discussions around creating Flight SQL extensions,
> and
> > > the
> > > > Rust SQL adapter could implement that and co-exist well.
> > > > From a crate dependency, *arrow-flight* depends on *arrow*, so it
> could
> > > > also depend on this *arrow-sql* crate.
> > > >
> > > > Please let me know what you think
> > > >
> > > > Regards
> > > > Neville
> > > >
> > >
> >
>

Re: [Rust] Arrow SQL Adapters/Connectors

Reply via email to