[
https://issues.apache.org/jira/browse/ARROW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834478#comment-16834478
]
Ji Liu edited comment on ARROW-5207 at 5/8/19 3:23 AM:
-------------------------------------------------------
[~jnadeau] Thanks a lot for your comments, I'm sorry to have misled you by the
title. You are right, what I mean is that reuse the same ArrowBuf within a
Vector in some cases, maybe the title should be 'enable resize buffers within
Vector for given valueCount and dataLength'.
The idea is inspired by the following case:
In general, in shuffle stage, the schema of data is fixed. So we could keep
several reuse vectors in a serializer to do deserialize work rather than create
these vectors for every deserialization. Since the serialization process have
written vectors(mainly its buffers) in stream(such as WritableChannel, not
using ArrowRecordBatch), when deserialize, we read buffer metas(valueCount and
dataLength) and buffers from channel and write them directly to the reuse
vector. In this case, if the read dataBuffer length is greater than the
dataBuffer.capacity within reuse vector or the read valueCount/8 is greater
than the validityBuffer.capacity within reuse vector we need to resize these
buffers to hold the read data. Otherwise if the read data is less than the
reuse vector capacity, just write data into its buffers without resizing buffer
size.
My proposal is add some APIs to make the process of resizing buffers
simpler(Now if we want to do this with BaseVariableWidthVector, it seems we
need to _setInitialCapacity_ and _allocateNew_ and the calculated size may be
not accurate enough because of density). Do you have any better idea or
suggestion for this? Thanks very much
was (Author: tianchen92):
[~jnadeau] Thanks a lot for your comments, I'm sorry to have misled you by the
title. You are right, what I mean is that reuse the same ArrowBuf within a
Vector in some cases, the idea is inspired by the following case:
In general, in shuffle stage, the schema of data is fixed. So we could keep a
reuse vector in a serializer to do serialize/deserialize work. With the given
serialized valueCount and dataBuffer length, we would like to deserialize
buffers directly to the reuse vector, in this case, there’s no problem if the
dataBuffer.capacity >= serialized dataBuffer length && valueCount within
reuseVector >= serialized valueCount, otherwise we have to resize buffers
within reuse vector which is not very easy (For example, resize buffers in
BaseVariableWidthVector need call _setInitialCapacity_ and _allocateNew_, in
this way, the buffer size is not very accurate because of density)
What I actually want to do is make this process of resizing buffers simpler. If
you think this is not formal, maybe we could move it to a utility class or do
you have some suggestions how to do this? Thanks very much.
> [Java] add APIs to support vector reuse
> ---------------------------------------
>
> Key: ARROW-5207
> URL: https://issues.apache.org/jira/browse/ARROW-5207
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Java
> Reporter: Ji Liu
> Assignee: Ji Liu
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> In some scenarios we hope that ValueVector could be reused to reduce creation
> overhead. This is very common in shuffle stage, it's no need to create
> ValueVector or realloc buffers every time, suppose that the recordCount of
> ValueVector and capacity of its buffers is written in stream, when we
> deserialize it, we can simply judge whether realloc is needed through
> dataLength.
> My proposal is that add APIs in ValueVector to process this logic, otherwise
> users have to implement by themselves if they want to reuse which is not
> user-friendly.
> If you agree with this, I would like to take this ticket. Thanks
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)