[ https://issues.apache.org/jira/browse/KUDU-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551774#comment-16551774 ]
Andrew Wong commented on KUDU-2077: ----------------------------------- [~jinxing6...@126.com] Yeah, I agree. Any sort of cross-component interaction with Kudu right now converts from Kudu's custom return format and it'd be nice if this were more standardized. I particularly think this would be interesting for implementing the ColumnarBatch API for Spark's [DataSourceV2|https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/sources/v2/DataSourceV2.html]. That said, I don't think anyone is working on this at the moment. > Return data in Apache Arrow format > ---------------------------------- > > Key: KUDU-2077 > URL: https://issues.apache.org/jira/browse/KUDU-2077 > Project: Kudu > Issue Type: New Feature > Components: client, server > Reporter: Andrew Wong > Priority: Major > > Dan and I spent the hackathon tinkering with the Apache Arrow format. Arrow > is an in-memory columnar format designed to be the common data format for a > large number of projects, see [here|https://arrow.apache.org]. One place we > thought adding this would be particularly fitting is when sending results > back to the client, since this currently returns row-wise data. By returning > Arrow, this could open the door to simpler and faster integration with other > projects. > The server-side changes can be localized to the tablet service and wire > protocol. We considered using Arrow more exhaustively throughout the server > codebase, but found that because Arrow and Kudu's own in-memory format (i.e. > that in kudu::ColumnBlock) are so similar, a simpler approach is to copy the > buffers from ColumnBlock to the scan response and build arrow::Arrays > client-side. A POC of the server-side changes can be found here: > https://github.com/danburkert/kudu/tree/arrow > At the time of writing this, the arrow::Array type has a varying number of > arrow::Buffers, depending on the data type (e.g. one for null bitmaps, one > for data, etc). The ColumnBlock "buffers" (i.e. data, null_bitmap) should be > compatible with these Buffers with a couple of modifications: > * The null-bitmaps in arrow are the complement of those used by Kudu > * The RowBlock that owns the ColumnBlocks has a selection vector needs to be > accounted for > If the buffers are transferred over the wire (via sidecars or protobuf), they > should be able to be converted to Arrays via arrow::ArrayData or directly via > the Array constructors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)