Andrew Wong created KUDU-2077:
---------------------------------
Summary: Return data in Apache Arrow format
Key: KUDU-2077
URL: https://issues.apache.org/jira/browse/KUDU-2077
Project: Kudu
Issue Type: New Feature
Components: client, server
Reporter: Andrew Wong
Dan and I spent the hackathon tinkering with the Apache Arrow format. Arrow is
an in-memory columnar format designed to be the common data format for a large
number of projects (see [here](https://arrow.apache.org/)). One place we
thought adding this would be particularly fitting is when sending results back
to the client, since this currently returns row-wise data. By returning Arrow,
this could open the door to simpler and faster integration with other projects.
The server-side changes can be localized to the tablet service and wire
protocol. We considered using Arrow more exhaustively throughout the server
codebase, but found that because Arrow and Kudu's own in-memory format (i.e.
that in kudu::ColumnBlock) are so similar, a simpler approach is to copy the
buffers from ColumnBlock to the scan response and build arrow::Arrays
client-side. A POC of the server-side changes can be found here:
https://github.com/danburkert/kudu/tree/arrow
At the time of writing this, the arrow::Array type has a varying number of
arrow::Buffers, depending on the data type (e.g. one for null bitmaps, one for
data, etc). The ColumnBlock "buffers" (i.e. data, null_bitmap) should be
compatible with these Buffers with a couple of modifications:
* The null-bitmaps in arrow are the complement of those used by Kudu
* The RowBlock that owns the ColumnBlocks has a selection vector needs to be
accounted for
If the buffers are transferred over the wire (via sidecars or protobuf), they
should be able to be converted to Arrays via arrow::ArrayData or directly via
the Array constructors.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)