[ 
https://issues.apache.org/jira/browse/ARROW-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877410#comment-16877410
 ] 

Ji Liu edited comment on ARROW-5821 at 7/3/19 7:41 AM:
-------------------------------------------------------

Thanks a lot for your feedback. [~jnadeau] [~wesmckinn]

More exactly, I suggest to provide a utility class in arrow-algorithm module 
and not break the IPC format anymore. The role the utility plays is that for a 
given fixed width vector which has lot of null values (e.g. valueCount=1000, 
nullCount=990), it could make non-null value move ahead and make valueCount=10, 
create a BitVector to trace null value indices. Meanwhile, for a given 
compacted vector and BitVector, it could recovery the original data format(e.g. 
valueCount=1000, nullCount=990).

In some cases, before shuffle and after shuffle, use this kind of utility will 
greatly reduce the data size. Moreover, the control is in the hands of users 
and we do not need worry about IPC format since we won't change it anymore.

Thanks!


was (Author: tianchen92):
Thanks a lot for your feedback. [~jnadeau] [~wesmckinn]

More exactly, I suggest to provide a utility class in arrow-algorithm module 
and not break the IPC format anymore. The role the utility plays is that for a 
given fixed width vector which has lot of null values (e.g. valueCount=1000, 
nullCount=990), it could create a new fixed width vector with valueCount=10 and 
a BitVector to trace null value indices. Meanwhile, for a given compacted 
vector and BitVector, it could recovery the original data format(e.g. 
valueCount=1000, nullCount=990).

In some cases, before shuffle and after shuffle, use this kind of utility will 
greatly reduce the data size. Moreover, the control is in the hands of users 
and we do not need worry about IPC format since we won't change it anymore.

Thanks!

> [Java] Support compact fixed-width vectors
> ------------------------------------------
>
>                 Key: ARROW-5821
>                 URL: https://issues.apache.org/jira/browse/ARROW-5821
>             Project: Apache Arrow
>          Issue Type: New Feature
>            Reporter: Ji Liu
>            Assignee: Ji Liu
>            Priority: Minor
>
> In shuffle stage of some applications, FixedWitdhVectors may have very little 
> non-null data.
> In this case, directly serialize vectors is not a good choice, generally we 
> can compact the vector make it only holding non-null value and create a 
> BitVector to trace the indices for non-null values so that it could be 
> deserialized properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to