Re: [Java] Append multiple record batches together?

Bryan Cutler Tue, 12 Nov 2019 11:29:31 -0800

Yes, you are correct. I think I was mixing up a couple different things. I
like the way C++/Python distinguishes it where a RecordBatch is contiguous
memory and a Table can be chunked. So since you are just talking about
RecordBatches, I think we should keep it contiguous and concat would
require memcpy. Maybe Java can add the concept of Tables and ChunkedArrays
sometime in the future.


On Mon, Nov 11, 2019 at 9:59 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I think having a chunked array with multiple vector buffers would be
>> ideal, similar to C++. It might take a fair amount of work to add this but
>> would open up a lot more functionality.
>
>
> There are potentially two different use-cases.  ChunkedArray is
> logical/lazy concatenation where as concat, physically rebuilds the vectors
> to be a single vector.
>
> On Fri, Nov 8, 2019 at 10:51 AM Bryan Cutler <cutl...@gmail.com> wrote:
>
>> I think having a chunked array with multiple vector buffers would be
>> ideal, similar to C++. It might take a fair amount of work to add this but
>> would open up a lot more functionality. As for the API,
>> VectorSchemaRoot.concat(Collection<VectorSchemaRoot>) seems good to me.
>>
>> On Thu, Nov 7, 2019 at 12:09 AM Fan Liya <liya.fa...@gmail.com> wrote:
>>
>>> Hi Micah,
>>>
>>> Thanks for bringing this up.
>>>
>>> > 1.  An efficient solution already exists? It seems like TransferPair
>>> implementations could possibly be improved upon or have they already been
>>> optimized?
>>>
>>> Fundamnentally, memory copy is unavoidable, IMO, because the source and
>>> targe memory regions are likely to be in non-contiguous regions.
>>> An alternative is to make ArrowBuf support a number of non-contiguous
>>> memory regions. However, that would harm the perfomance of ArrowBuf, and
>>> ArrowBuf is the core of the Arrow library.
>>>
>>> > 2.  What the preferred API for doing this would be?  Some options i can
>>> think of:
>>>
>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>>
>>> IMO, option 1 is required, as we have scenarios that need to concate
>>> vectors/VectorSchemaRoots (e.g. restore the complete dictionary from
>>> delta
>>> dictionaries).
>>> Options 2 and 3 are optional for us.
>>>
>>> Best,
>>> Liya Fan
>>>
>>> On Thu, Nov 7, 2019 at 3:44 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> > A colleague opened up https://issues.apache.org/jira/browse/ARROW-7048
>>> for
>>> > having similar functionality to the python APIs that allow for
>>> creating one
>>> > larger data structure from a series of record batches.  I just wanted
>>> to
>>> > surface it here in case:
>>> > 1.  An efficient solution already exists? It seems like TransferPair
>>> > implementations could possibly be improved upon or have they already
>>> been
>>> > optimized?
>>> > 2.  What the preferred API for doing this would be?  Some options i can
>>> > think of:
>>> >
>>> > * VectorSchemaRoot.concat(Collection<VectorSchemaRoot>)
>>> > * VectorSchemaRoot.from(Collection<ArrowRecordBatch>)
>>> > * VectorLoader.load(Collection<ArrowRecordBatch>)
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>>
>>

Re: [Java] Append multiple record batches together?

Reply via email to