[ 
https://issues.apache.org/jira/browse/KUDU-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated KUDU-2077:
------------------------------
    Description: 
Dan and I spent the hackathon tinkering with the Apache Arrow format. Arrow is 
an in-memory columnar format designed to be the common data format for a large 
number of projects see [here|https://arrow.apache.org]. One place we thought 
adding this would be particularly fitting is when sending results back to the 
client, since this currently returns row-wise data. By returning Arrow, this 
could open the door to simpler and faster integration with other projects.

The server-side changes can be localized to the tablet service and wire 
protocol. We considered using Arrow more exhaustively throughout the server 
codebase, but found that because Arrow and Kudu's own in-memory format (i.e. 
that in kudu::ColumnBlock) are so similar, a simpler approach is to copy the 
buffers from ColumnBlock to the scan response and build arrow::Arrays 
client-side. A POC of the server-side changes can be found here: 
https://github.com/danburkert/kudu/tree/arrow

At the time of writing this, the arrow::Array type has a varying number of 
arrow::Buffers, depending on the data type (e.g. one for null bitmaps, one for 
data, etc). The ColumnBlock "buffers" (i.e. data, null_bitmap) should be 
compatible with these Buffers with a couple of modifications:

* The null-bitmaps in arrow are the complement of those used by Kudu
* The RowBlock that owns the ColumnBlocks has a selection vector needs to be 
accounted for

If the buffers are transferred over the wire (via sidecars or protobuf), they 
should be able to be converted to Arrays via arrow::ArrayData or directly via 
the Array constructors.

  was:
Dan and I spent the hackathon tinkering with the Apache Arrow format. Arrow is 
an in-memory columnar format designed to be the common data format for a large 
number of projects (see [here](https://arrow.apache.org/)). One place we 
thought adding this would be particularly fitting is when sending results back 
to the client, since this currently returns row-wise data. By returning Arrow, 
this could open the door to simpler and faster integration with other projects.

The server-side changes can be localized to the tablet service and wire 
protocol. We considered using Arrow more exhaustively throughout the server 
codebase, but found that because Arrow and Kudu's own in-memory format (i.e. 
that in kudu::ColumnBlock) are so similar, a simpler approach is to copy the 
buffers from ColumnBlock to the scan response and build arrow::Arrays 
client-side. A POC of the server-side changes can be found here: 
https://github.com/danburkert/kudu/tree/arrow

At the time of writing this, the arrow::Array type has a varying number of 
arrow::Buffers, depending on the data type (e.g. one for null bitmaps, one for 
data, etc). The ColumnBlock "buffers" (i.e. data, null_bitmap) should be 
compatible with these Buffers with a couple of modifications:

* The null-bitmaps in arrow are the complement of those used by Kudu
* The RowBlock that owns the ColumnBlocks has a selection vector needs to be 
accounted for

If the buffers are transferred over the wire (via sidecars or protobuf), they 
should be able to be converted to Arrays via arrow::ArrayData or directly via 
the Array constructors.


> Return data in Apache Arrow format
> ----------------------------------
>
>                 Key: KUDU-2077
>                 URL: https://issues.apache.org/jira/browse/KUDU-2077
>             Project: Kudu
>          Issue Type: New Feature
>          Components: client, server
>            Reporter: Andrew Wong
>
> Dan and I spent the hackathon tinkering with the Apache Arrow format. Arrow 
> is an in-memory columnar format designed to be the common data format for a 
> large number of projects see [here|https://arrow.apache.org]. One place we 
> thought adding this would be particularly fitting is when sending results 
> back to the client, since this currently returns row-wise data. By returning 
> Arrow, this could open the door to simpler and faster integration with other 
> projects.
> The server-side changes can be localized to the tablet service and wire 
> protocol. We considered using Arrow more exhaustively throughout the server 
> codebase, but found that because Arrow and Kudu's own in-memory format (i.e. 
> that in kudu::ColumnBlock) are so similar, a simpler approach is to copy the 
> buffers from ColumnBlock to the scan response and build arrow::Arrays 
> client-side. A POC of the server-side changes can be found here: 
> https://github.com/danburkert/kudu/tree/arrow
> At the time of writing this, the arrow::Array type has a varying number of 
> arrow::Buffers, depending on the data type (e.g. one for null bitmaps, one 
> for data, etc). The ColumnBlock "buffers" (i.e. data, null_bitmap) should be 
> compatible with these Buffers with a couple of modifications:
> * The null-bitmaps in arrow are the complement of those used by Kudu
> * The RowBlock that owns the ColumnBlocks has a selection vector needs to be 
> accounted for
> If the buffers are transferred over the wire (via sidecars or protobuf), they 
> should be able to be converted to Arrays via arrow::ArrayData or directly via 
> the Array constructors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to