[
https://issues.apache.org/jira/browse/ARROW-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764582#comment-16764582
]
Eric Erhardt commented on ARROW-4502:
-------------------------------------
[~cshutchinson] - I've taken a look at this over the weekend, and there are a
couple things I wanted to run by you.
# All of the "flat buffer" types are public - is this intentional? They feel
more like an "implementation detail" to me, and I think we should make them
internal. Thoughts?
# In order to use `ReadOnlyMemory<byte>` in the API, we will need to split the
`ByteBuffer` class into two: an editable version vs. a read-only version. Just
like how ReadOnlySpan vs. Span and ReadOnlyMemory vs. Memory are split out. The
reason is because I need to pass in a `ReadOnlyMemory<byte>` into a ByteBuffer
in order to read "Messages". Note: I also needed to change ByteBuffer to be
backed by a Memory instead of a managed `byte[]`. The reasoning here is because
someone may be passing in Arrow RecordBatch data in native memory (for interop
scenarios with other languages like C++). It shouldn't be necessary to copy
that native memory into a managed `byte[]` just to read the RecordBatch.
I have some preliminary perf results to share with my investigation. Reading in
~1 million records with 7 number columns and adding up all the numbers is a lot
faster without doing the allocations and copies. Here are some benchmark
results of my prototype vs. the current ArrowStreamReader both reading from an
in-memory buffer (MemoryStream):
Method | Mean | Gen 0/1k Op | Allocated Memory/Op |
----------------------- |-----------:|----------:|----------:|
ArrowStreamReader | 110.018 ms | 21000.0000 |110693.78 KB |
ArrowRecordBatchReader | 6.789 ms | - | 63.52 KB |
> [C#] Add support for zero-copy reads
> ------------------------------------
>
> Key: ARROW-4502
> URL: https://issues.apache.org/jira/browse/ARROW-4502
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C#
> Reporter: Eric Erhardt
> Assignee: Eric Erhardt
> Priority: Major
> Labels: performance
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> In the Python (and C++) API, you can create a `RecordBatchStreamReader`, and
> if you give it an `InputStream` that supports zero-copy reads, you can get
> back `RecordBatch` objects without allocating new memory and copying all the
> data.
> There is currently no way to read Arrow RecordBatch instances without
> allocating new memory and copying all the data. We should enable this
> scenario in the C# API.
>
> My proposal is to create a new `class ArrowRecordBatchReader : IArrowReader`.
> It's constructor will take a `ReadOnlyMemory<byte> data` parameter, and it
> will be able to read `RecordBatch` instances just like the existing
> `ArrowStreamReader`. As part of this new class, we will refactor any common
> code out of `ArrowStreamReader` in order for the parsing logic to be shared,
> where necessary.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)