[ 
https://issues.apache.org/jira/browse/KUDU-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551774#comment-16551774
 ] 

Andrew Wong commented on KUDU-2077:
-----------------------------------

[~jinxing6...@126.com]

Yeah, I agree. Any sort of cross-component interaction with Kudu right now 
converts from Kudu's custom return format and it'd be nice if this were more 
standardized. I particularly think this would be interesting for implementing 
the ColumnarBatch API for Spark's 
[DataSourceV2|https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/sources/v2/DataSourceV2.html].

That said, I don't think anyone is working on this at the moment.

 

> Return data in Apache Arrow format
> ----------------------------------
>
>                 Key: KUDU-2077
>                 URL: https://issues.apache.org/jira/browse/KUDU-2077
>             Project: Kudu
>          Issue Type: New Feature
>          Components: client, server
>            Reporter: Andrew Wong
>            Priority: Major
>
> Dan and I spent the hackathon tinkering with the Apache Arrow format. Arrow 
> is an in-memory columnar format designed to be the common data format for a 
> large number of projects, see [here|https://arrow.apache.org]. One place we 
> thought adding this would be particularly fitting is when sending results 
> back to the client, since this currently returns row-wise data. By returning 
> Arrow, this could open the door to simpler and faster integration with other 
> projects.
> The server-side changes can be localized to the tablet service and wire 
> protocol. We considered using Arrow more exhaustively throughout the server 
> codebase, but found that because Arrow and Kudu's own in-memory format (i.e. 
> that in kudu::ColumnBlock) are so similar, a simpler approach is to copy the 
> buffers from ColumnBlock to the scan response and build arrow::Arrays 
> client-side. A POC of the server-side changes can be found here: 
> https://github.com/danburkert/kudu/tree/arrow
> At the time of writing this, the arrow::Array type has a varying number of 
> arrow::Buffers, depending on the data type (e.g. one for null bitmaps, one 
> for data, etc). The ColumnBlock "buffers" (i.e. data, null_bitmap) should be 
> compatible with these Buffers with a couple of modifications:
> * The null-bitmaps in arrow are the complement of those used by Kudu
> * The RowBlock that owns the ColumnBlocks has a selection vector needs to be 
> accounted for
> If the buffers are transferred over the wire (via sidecars or protobuf), they 
> should be able to be converted to Arrays via arrow::ArrayData or directly via 
> the Array constructors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to