hi folks, I have long desired since the project's inception to develop higher performance database clients that natively return Arrow columnar format. This is a natural analogue to building Arrow-native interfaces to storage formats like Parquet and ORC. If we can't get fast access to data, many other parts of the project become less useful.
Example databases I have actively used include: - PostgreSQL (see ARROW-1106) - HiveServer2: for Apache Hive and Apache Impala (see ARROW-3050) There's good reason to build this software in Apache Arrow: * Define reusable Arrow-oriented abstractions for putting and getting result sets from databases * Define reusable APIs in the bindings (Python, Ruby, R, etc.) * Benefit from a common build toolchain so that packaging in e.g. Python is much simpler * Fewer release / packaging cycles to manage (I don't have the capacity to manage any more release and packaging cycles than I am already involved with) The only example of an Arrow-native DB client so far is the Turbodbc project (https://github.com/blue-yonder/turbodbc). I actually think that it would be beneficial to have native ODBC interop in Apache Arrow (we have JDBC now recently) but it's fine with me if the Turbodbc community wishes to remain long term a third party project under its own governance and release cycle. While I was still at Cloudera I helped develop a small C++ and Python library (Apache 2.0) for interacting with HiveServer2, but it has become abandonware. I have taken the liberty of forking this code and modifying it to build as an optional component of the Arrow C++ codebase: https://github.com/apache/arrow/pull/2444 I would like to merge this PR and proceed with creating more database interfaces within the project, and defining common abstractions to help users access data faster and be more productive. Thanks, Wes