That would be awesome! I agree with this, and would be really useful, as it
would leverage all the goodies that RDMS have wrt to transitions, etc.

I would probably go for having database-specifics outside of the arrow
project, so that they can be used by other folks beyond arrow, and keep the
arrow-specifics (i.e. conversion from the format from the specific
databases to arrow) as part of the arrow crate. Ideally as Wes wrote, with
some standard to be easier to handle different DBs.

I think that there are two layers: one is how to connect to a database, the
other is how to serialize/deserialize. AFAIK PEP 249 covers both layers, as
it standardizes things like `connect` and `tpc_begin`, as well as how
things should be serialized to Python objects (e.g. dates should be
datetime.date). This split is done by postgres for Rust
<https://github.com/sfackler/rust-postgres>, as it offers 5 crates:
* postges-async
* postges-sync (a blocking wrapper of postgres-async)
* postges-types (to convert to native rust  <---- IMO this one is what we
want to offer in Arrow)
* postges-TLS
* postges-openssl

`postges-sync` implements Iterator<Row> (`client.query`), and postges-async
implements Stream<Row>.

One idea is to have a generic<T> iterator/stream adapter, that yields
RecordBatches. The implementation of this trait by different providers
would give support to be used in Arrow and DataFusion.

Besides postgres, one idea is to pick the top from this list
<https://db-engines.com/en/ranking>:

* Oracle
* MySQL
* MsSQL

Another idea is to start by by supporting SQLite, which is a good
development env to work with relational databases.

Best,
Jorge





On Sun, Sep 27, 2020 at 4:22 AM Neville Dipale <nevilled...@gmail.com>
wrote:

> Hi Arrow developers
>
> I would like to gauge the appetite for an Arrow SQL connector that:
>
> * Reads and writes Arrow data to and from SQL databases
> * Reads tables and queries into record batches, and writes batches to
> tables (either append or overwrite)
> * Leverages binary SQL formats where available (e.g. PostgreSQL format is
> relatively easy and well-documented)
> * Provides a batch interface that abstracts away the different database
> semantics, and exposes a RecordBatchReader (
> https://docs.rs/arrow/1.0.1/arrow/record_batch/trait.RecordBatchReader.html
> ),
> and perhaps a RecordBatchWriter
> * Resides in the Rust repo as either an arrow::sql module (like arrow::csv,
> arrow::json, arrow::ipc) or alternatively is a separate crate in the
> workspace  (*arrow-sql*?)
>
> I would be able to contribute a Postgres reader/writer as a start.
> I could make this a separate crate, but to drive adoption I would prefer
> this living in Arrow, also it can remain updated (sometimes we reorganise
> modules and end up breaking dependencies).
>
> Also, being developed next to DataFusion could allow DF to support SQL
> databases, as this would be yet another datasource.
>
> Some questions:
> * Should such library support async, sync or both IO methods?
> * Other than postgres, what other databases would be interesting? Here I'm
> hoping that once we've established a suitable API, it could be easier to
> natively support more database types.
>
> Potential concerns:
>
> * Sparse database support
> It's a lot of effort to write database connectors, especially if starting
> from scratch (unlike with say JDBC). What if we end up supporting 1 or 2
> database servers?
> Perhaps in that case we could keep the module without publishing it to
> crates.io until we're happy with database support, or even its usage.
>
> * Dependency bloat
> We could feature-gate database types to reduce the number of dependencies
> if one only wants certain DB connectors
>
> * Why not use Java's JDBC adapter?
> I already do this, but sometimes if working on a Rust project, creating a
> separate JVM service solely to extract Arrow data is a lot of effort.
> I also don't think it's currently possible to use the adapter to save Arrow
> data in a database.
>
> * What about Flight SQL extensions?
> There have been discussions around creating Flight SQL extensions, and the
> Rust SQL adapter could implement that and co-exist well.
> From a crate dependency, *arrow-flight* depends on *arrow*, so it could
> also depend on this *arrow-sql* crate.
>
> Please let me know what you think
>
> Regards
> Neville
>

Reply via email to