Developing native Arrow interfaces to database protocols

Wes McKinney Tue, 21 Aug 2018 07:38:48 -0700

hi folks,

I have long desired since the project's inception to develop higher
performance database clients that natively return Arrow columnar
format. This is a natural analogue to building Arrow-native interfaces
to storage formats like Parquet and ORC. If we can't get fast access
to data, many other parts of the project become less useful.


Example databases I have actively used include:

- PostgreSQL (see ARROW-1106)
- HiveServer2: for Apache Hive and Apache Impala (see ARROW-3050)

There's good reason to build this software in Apache Arrow:

* Define reusable Arrow-oriented abstractions for putting and getting
result sets from databases
* Define reusable APIs in the bindings (Python, Ruby, R, etc.)
* Benefit from a common build toolchain so that packaging in e.g.
Python is much simpler
* Fewer release / packaging cycles to manage (I don't have the
capacity to manage any more release and packaging cycles than I am
already involved with)

The only example of an Arrow-native DB client so far is the Turbodbc
project (https://github.com/blue-yonder/turbodbc). I actually think
that it would be beneficial to have native ODBC interop in Apache
Arrow (we have JDBC now recently) but it's fine with me if the
Turbodbc community wishes to remain long term a third party project
under its own governance and release cycle.

While I was still at Cloudera I helped develop a small C++ and Python
library (Apache 2.0) for interacting with HiveServer2, but it has
become abandonware. I have taken the liberty of forking this code and
modifying it to build as an optional component of the Arrow C++
codebase:

https://github.com/apache/arrow/pull/2444

I would like to merge this PR and proceed with creating more database
interfaces within the project, and defining common abstractions to
help users access data faster and be more productive.

Thanks,
Wes

Developing native Arrow interfaces to database protocols

Reply via email to