[ 
https://issues.apache.org/jira/browse/ARROW-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764582#comment-16764582
 ] 

Eric Erhardt commented on ARROW-4502:
-------------------------------------

[~cshutchinson] - I've taken a look at this over the weekend, and there are a 
couple things I wanted to run by you.
 # All of the "flat buffer" types are public - is this intentional? They feel 
more like an "implementation detail" to me, and I think we should make them 
internal. Thoughts?
 # In order to use `ReadOnlyMemory<byte>` in the API, we will need to split the 
`ByteBuffer` class into two: an editable version vs. a read-only version. Just 
like how ReadOnlySpan vs. Span and ReadOnlyMemory vs. Memory are split out. The 
reason is because I need to pass in a `ReadOnlyMemory<byte>` into a ByteBuffer 
in order to read "Messages". Note: I also needed to change ByteBuffer to be 
backed by a Memory instead of a managed `byte[]`. The reasoning here is because 
someone may be passing in Arrow RecordBatch data in native memory (for interop 
scenarios with other languages like C++). It shouldn't be necessary to copy 
that native memory into a managed `byte[]` just to read the RecordBatch.

 

I have some preliminary perf results to share with my investigation. Reading in 
~1 million records with 7 number columns and adding up all the numbers is a lot 
faster without doing the allocations and copies. Here are some benchmark 
results of my prototype vs. the current ArrowStreamReader both reading from an 
in-memory buffer (MemoryStream):

Method | Mean | Gen 0/1k Op | Allocated Memory/Op |
----------------------- |-----------:|----------:|----------:|
 ArrowStreamReader | 110.018 ms | 21000.0000 |110693.78 KB |
 ArrowRecordBatchReader | 6.789 ms | - | 63.52 KB |

 

> [C#] Add support for zero-copy reads
> ------------------------------------
>
>                 Key: ARROW-4502
>                 URL: https://issues.apache.org/jira/browse/ARROW-4502
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C#
>            Reporter: Eric Erhardt
>            Assignee: Eric Erhardt
>            Priority: Major
>              Labels: performance
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the Python (and C++) API, you can create a `RecordBatchStreamReader`, and 
> if you give it an `InputStream` that supports zero-copy reads, you can get 
> back `RecordBatch` objects without allocating new memory and copying all the 
> data.
> There is currently no way to read Arrow RecordBatch instances without 
> allocating new memory and copying all the data. We should enable this 
> scenario in the C# API.
>  
> My proposal is to create a new `class ArrowRecordBatchReader : IArrowReader`. 
> It's constructor will take a `ReadOnlyMemory<byte> data` parameter, and it 
> will be able to read `RecordBatch` instances just like the existing 
> `ArrowStreamReader`. As part of this new class, we will refactor any common 
> code out of `ArrowStreamReader` in order for the parsing logic to be shared, 
> where necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to