[ 
https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839050#comment-16839050
 ] 

Micah Kornfield commented on ARROW-5224:
----------------------------------------

[~tianchen92] my main concern with this change is that it shouldn't be a 
one-off for java.  If there is utility of these types of on the wire encodings 
we should come up with a supportable way to make them work across language 
implementations.  I think this is important to discuss on the mailing list 
directly (many people filter out JIRA/Pull requests).   Real performance 
numbers/benchmarks would be helpful in making the case to support this.  Also, 
I'm also curious if you measured to doing blackbox compression with something 
like snappy (the link I provided above) to see if there is still benefit of the 
encoding after applying compression, to the entire vector.

If we are going to make encodings supportable we should either extend 
Schema.fbs or use the custom metadata that is already built into the schema 
(https://github.com/apache/arrow/blob/master/format/Schema.fbs#L265) so 
encodings can be communicated across clients.  Again since convention/design 
needs to be agreed upon discussing on the mailing list is important.

I think a utility class to  convert between BigIntVector and encoded 
VarBinaryVector could also be a potentially valuable contribution, but for this 
use-case I think you lose a lot of the value of encoding (you have a 4-byte 
overhead to keep track of the offsets per encoded entry).



> [Java] Add APIs for supporting directly serialize/deserialize ValueVector
> -------------------------------------------------------------------------
>
>                 Key: ARROW-5224
>                 URL: https://issues.apache.org/jira/browse/ARROW-5224
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Ji Liu
>            Assignee: Ji Liu
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> There is no API to directly serialize/deserialize ValueVector. The only way 
> to implement this is to put a single FieldVector in VectorSchemaRoot and 
> convert it to ArrowRecordBatch, and the deserialize process is as well. 
> Provide a utility class to implement this may be better, I know all 
> serializations should follow IPC format so that data can be shared between 
> different Arrow implementations. But for users who only use Java API and want 
> to do some further optimization, this seem to be no problem and we could 
> provide them a more option.
> This may take some benefits for Java user who only use ValueVector rather 
> than IPC series classes such as ArrowReordBatch:
>  * We could do some shuffle optimization such as compression and some 
> encoding algorithm for numerical type which could greatly improve performance.
>  * Do serialize/deserialize with the actual buffer size within vector since 
> the buffer size is power of 2 which is actually bigger than it really need.
>  * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it 
> user-friendly.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to