[jira] [Commented] (ARROW-4502) [C#] Add support for zero-copy reads

Chris Hutchinson (JIRA) Sun, 10 Feb 2019 15:11:26 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764593#comment-16764593
 ]


Chris Hutchinson commented on ARROW-4502:
-----------------------------------------

[~eerhardt] 

1. The intent was to auto-generate the FlatBuffer code at build time using the 
Google FlatBuffer schema compiler tooling 
([https://google.github.io/flatbuffers/flatbuffers_guide_using_schema_compiler.html).]
 This was never integrated due to initial time constraints and cross-platform 
concerns. We opted for just including the output of the schema compiler in the 
interest of producing a working prototype. I agree that all of these types 
should be marked internal. I guess just be aware all of the FlatBuffer code was 
generated by the schema compiler, and that's why the types are marked public. 
There wasn't a lot of time invested in adjusting the compiler to mark the types 
as internal.

2. ByteBuffer was auto-generated by the Google FlatBuffer schema compiler (as 
mentioned above). It seemed advantageous to use the compiler because it would 
be trivial to keep parity with the Arrow specification. It appears that the 
best course of action here may be to investigate if the FlatBuffer schema 
compiler can be patched or hooked in order to produce zero-copy variants of 
ByteBuffer that use Memory/Span. Another option may be to forgo the FlatBuffer 
compiler entirely and write a custom solution for serialization that otherwise 
follows the FlatBuffer spec. Thoughts? You definitely have the correct idea but 
the problem is you are modifying generated code, so we should be careful with 
that.

3. It would be great if you could include those benchmarks in the solution to 
help compare with future improvements. If you have the time for this, consider 
the file system structure used by Entity Framework Core project: 
[https://github.com/aspnet/EntityFrameworkCore]

4. Consider taking a look at recent changes to ArrowBufferBuilder<T>. This was 
recently changed to use managed memory until the buffer is "built". The 
intention here was a) make it possible to build multiple copies of an array 
backed by separate memory, b) reuse the buffer builder. I'm not sure that this 
was a good idea. Do you have any thoughts here? The approach that Rust takes is 
to build the buffer up using the typical Arrow memory allocation strategies, 
then "freeze" the buffer memory into a new Arrow buffer. This implies a single 
builder per buffer. That's not necessarily a bad thing. It means (essentially) 
one allocation per buffer instead of (possibly) two. It may be better for API 
usability and in the spirit of minimizing allocations. Thoughts here? 

Thanks for taking a look. I'm sure there's many areas of improvement. Let me 
know if you have any more questions. I look forward to collaborating with you!

 

> [C#] Add support for zero-copy reads
> ------------------------------------
>
>                 Key: ARROW-4502
>                 URL: https://issues.apache.org/jira/browse/ARROW-4502
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C#
>            Reporter: Eric Erhardt
>            Assignee: Eric Erhardt
>            Priority: Major
>              Labels: performance
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the Python (and C++) API, you can create a `RecordBatchStreamReader`, and 
> if you give it an `InputStream` that supports zero-copy reads, you can get 
> back `RecordBatch` objects without allocating new memory and copying all the 
> data.
> There is currently no way to read Arrow RecordBatch instances without 
> allocating new memory and copying all the data. We should enable this 
> scenario in the C# API.
>  
> My proposal is to create a new `class ArrowRecordBatchReader : IArrowReader`. 
> It's constructor will take a `ReadOnlyMemory<byte> data` parameter, and it 
> will be able to read `RecordBatch` instances just like the existing 
> `ArrowStreamReader`. As part of this new class, we will refactor any common 
> code out of `ArrowStreamReader` in order for the parsing logic to be shared, 
> where necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-4502) [C#] Add support for zero-copy reads

Reply via email to