[ 
https://issues.apache.org/jira/browse/ARROW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834478#comment-16834478
 ] 

Ji Liu commented on ARROW-5207:
-------------------------------

[~jnadeau] Thanks a lot for your comments, I'm sorry to have misled you by the 
title. You are right, what I mean is that reuse the same ArrowBuf within a 
Vector in some cases, the idea is inspired by the following case:

In general, in shuffle stage, the schema of data is fixed. So we could keep a 
reuse vector in a serializer to do serialize/deserialize work. With the given 
serialized valueCount and dataBuffer length, we would like to deserialize 
buffers directly to the reuse vector, in this case, there’s no problem if the 
dataBuffer.capacity >= serialized dataBuffer length && valueCount within 
reuseVector >= serialized valueCount, otherwise we have to resize buffers 
within reuse vector which is not very easy (For example, resize buffers in 
BaseVariableWidthVector need call _setInitialCapacity_ and _allocateNew_, in 
this way, the buffer size is not very accurate because of density)

What I actually want to do is make this process of resizing buffers simpler. If 
you think this is not formal, maybe we could move it to a utility class or do 
you have some suggestions how to do this? Thanks very much.

> [Java] add APIs to support vector reuse
> ---------------------------------------
>
>                 Key: ARROW-5207
>                 URL: https://issues.apache.org/jira/browse/ARROW-5207
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Ji Liu
>            Assignee: Ji Liu
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In some scenarios we hope that ValueVector could be reused to reduce creation 
> overhead. This is very common in shuffle stage, it's no need to create 
> ValueVector or realloc buffers every time, suppose that the recordCount of 
> ValueVector and capacity of its buffers is written in stream, when we 
> deserialize it, we can simply judge whether realloc is needed through 
> dataLength.
> My proposal is that add APIs in ValueVector to process this logic, otherwise 
> users have to implement by themselves if they want to reuse which is not 
> user-friendly. 
> If you agree with this, I would like to take this ticket. Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to