[jira] [Comment Edited] (ARROW-5207) [Java] add APIs to support vector reuse

Ji Liu (JIRA) Tue, 07 May 2019 20:26:35 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834478#comment-16834478
 ]


Ji Liu edited comment on ARROW-5207 at 5/8/19 3:23 AM:
-------------------------------------------------------

[~jnadeau] Thanks a lot for your comments, I'm sorry to have misled you by the 
title. You are right, what I mean is that reuse the same ArrowBuf within a 
Vector in some cases, maybe the title should be 'enable resize buffers within 
Vector for given valueCount and dataLength'.

The idea is inspired by the following case:

In general, in shuffle stage, the schema of data is fixed. So we could keep 
several reuse vectors in a serializer to do deserialize work rather than create 
these vectors for every deserialization. Since the serialization process have 
written vectors(mainly its buffers) in stream(such as WritableChannel, not 
using ArrowRecordBatch), when deserialize, we read buffer metas(valueCount and 
dataLength) and buffers from channel and write them directly to the reuse 
vector. In this case, if the read dataBuffer length is greater than the 
dataBuffer.capacity within reuse vector or the read valueCount/8 is greater 
than the validityBuffer.capacity within reuse vector we need to resize these 
buffers to hold the read data. Otherwise if the read data is less than the 
reuse vector capacity, just write data into its buffers without resizing buffer 
size.

My proposal is add some APIs to make the process of resizing buffers 
simpler(Now if we want to do this with BaseVariableWidthVector, it seems we 
need to _setInitialCapacity_ and _allocateNew_ and the calculated size may be 
not accurate enough because of density). Do you have any better idea or 
suggestion for this? Thanks very much


was (Author: tianchen92):
[~jnadeau] Thanks a lot for your comments, I'm sorry to have misled you by the 
title. You are right, what I mean is that reuse the same ArrowBuf within a 
Vector in some cases, the idea is inspired by the following case:

In general, in shuffle stage, the schema of data is fixed. So we could keep a 
reuse vector in a serializer to do serialize/deserialize work. With the given 
serialized valueCount and dataBuffer length, we would like to deserialize 
buffers directly to the reuse vector, in this case, there’s no problem if the 
dataBuffer.capacity >= serialized dataBuffer length && valueCount within 
reuseVector >= serialized valueCount, otherwise we have to resize buffers 
within reuse vector which is not very easy (For example, resize buffers in 
BaseVariableWidthVector need call _setInitialCapacity_ and _allocateNew_, in 
this way, the buffer size is not very accurate because of density)

What I actually want to do is make this process of resizing buffers simpler. If 
you think this is not formal, maybe we could move it to a utility class or do 
you have some suggestions how to do this? Thanks very much.

> [Java] add APIs to support vector reuse
> ---------------------------------------
>
>                 Key: ARROW-5207
>                 URL: https://issues.apache.org/jira/browse/ARROW-5207
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Ji Liu
>            Assignee: Ji Liu
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In some scenarios we hope that ValueVector could be reused to reduce creation 
> overhead. This is very common in shuffle stage, it's no need to create 
> ValueVector or realloc buffers every time, suppose that the recordCount of 
> ValueVector and capacity of its buffers is written in stream, when we 
> deserialize it, we can simply judge whether realloc is needed through 
> dataLength.
> My proposal is that add APIs in ValueVector to process this logic, otherwise 
> users have to implement by themselves if they want to reuse which is not 
> user-friendly. 
> If you agree with this, I would like to take this ticket. Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-5207) [Java] add APIs to support vector reuse

Reply via email to