[
https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838264#comment-16838264
]
Ji Liu edited comment on ARROW-5224 at 5/13/19 4:53 AM:
--------------------------------------------------------
[~jnadeau] Thanks very much for your feedback. There are two aspects:
1、It seems hard to do some specific optimization with existing API, for
example, encoding for int/long is a very useful optimization which could reduce
shuffle data. And this is the major inspiration.
2、Not sure that ArrowBuf size within ValueVector is greater than its real size
since it will allocate size of next power of 2?If so, this is a waste for
network.
We propose to add a utility class to do implement this, making it easy to do
some further optimization. This can be used as a option which will not break
Arrow standard format.
was (Author: tianchen92):
[~jnadeau] Thanks very much for your feedback. There are two aspects:
1、It seems hard to do some specific optimization with existing API, for
example, encoding for int/long is a very useful optimization which could reduce
shuffle data. And this is the major inspiration.
2、Not sure that ArrowBuf size within ValueVector is greater than its real size
since it will allocate size of next power of 2?If so, this is a waste for
network.
> [Java] Add APIs for supporting directly serialize/deserialize ValueVector
> -------------------------------------------------------------------------
>
> Key: ARROW-5224
> URL: https://issues.apache.org/jira/browse/ARROW-5224
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Ji Liu
> Assignee: Ji Liu
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 2h 20m
> Remaining Estimate: 0h
>
> There is no API to directly serialize/deserialize ValueVector. The only way
> to implement this is to put a single FieldVector in VectorSchemaRoot and
> convert it to ArrowRecordBatch, and the deserialize process is as well.
> Provide a utility class to implement this may be better, I know all
> serializations should follow IPC format so that data can be shared between
> different Arrow implementations. But for users who only use Java API and want
> to do some further optimization, this seem to be no problem and we could
> provide them a more option.
> This may take some benefits for Java user who only use ValueVector rather
> than IPC series classes such as ArrowReordBatch:
> * We could do some shuffle optimization such as compression and some
> encoding algorithm for numerical type which could greatly improve performance.
> * Do serialize/deserialize with the actual buffer size within vector since
> the buffer size is power of 2 which is actually bigger than it really need.
> * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it
> user-friendly.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)