[
https://issues.apache.org/jira/browse/ARROW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834478#comment-16834478
]
Ji Liu commented on ARROW-5207:
-------------------------------
[~jnadeau] Thanks a lot for your comments, I'm sorry to have misled you by the
title. You are right, what I mean is that reuse the same ArrowBuf within a
Vector in some cases, the idea is inspired by the following case:
In general, in shuffle stage, the schema of data is fixed. So we could keep a
reuse vector in a serializer to do serialize/deserialize work. With the given
serialized valueCount and dataBuffer length, we would like to deserialize
buffers directly to the reuse vector, in this case, there’s no problem if the
dataBuffer.capacity >= serialized dataBuffer length && valueCount within
reuseVector >= serialized valueCount, otherwise we have to resize buffers
within reuse vector which is not very easy (For example, resize buffers in
BaseVariableWidthVector need call _setInitialCapacity_ and _allocateNew_, in
this way, the buffer size is not very accurate because of density)
What I actually want to do is make this process of resizing buffers simpler. If
you think this is not formal, maybe we could move it to a utility class or do
you have some suggestions how to do this? Thanks very much.
> [Java] add APIs to support vector reuse
> ---------------------------------------
>
> Key: ARROW-5207
> URL: https://issues.apache.org/jira/browse/ARROW-5207
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Java
> Reporter: Ji Liu
> Assignee: Ji Liu
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> In some scenarios we hope that ValueVector could be reused to reduce creation
> overhead. This is very common in shuffle stage, it's no need to create
> ValueVector or realloc buffers every time, suppose that the recordCount of
> ValueVector and capacity of its buffers is written in stream, when we
> deserialize it, we can simply judge whether realloc is needed through
> dataLength.
> My proposal is that add APIs in ValueVector to process this logic, otherwise
> users have to implement by themselves if they want to reuse which is not
> user-friendly.
> If you agree with this, I would like to take this ticket. Thanks
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)