[
https://issues.apache.org/jira/browse/ARROW-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764593#comment-16764593
]
Chris Hutchinson commented on ARROW-4502:
-----------------------------------------
[~eerhardt]
1. The intent was to auto-generate the FlatBuffer code at build time using the
Google FlatBuffer schema compiler tooling
([https://google.github.io/flatbuffers/flatbuffers_guide_using_schema_compiler.html).]
This was never integrated due to initial time constraints and cross-platform
concerns. We opted for just including the output of the schema compiler in the
interest of producing a working prototype. I agree that all of these types
should be marked internal. I guess just be aware all of the FlatBuffer code was
generated by the schema compiler, and that's why the types are marked public.
There wasn't a lot of time invested in adjusting the compiler to mark the types
as internal.
2. ByteBuffer was auto-generated by the Google FlatBuffer schema compiler (as
mentioned above). It seemed advantageous to use the compiler because it would
be trivial to keep parity with the Arrow specification. It appears that the
best course of action here may be to investigate if the FlatBuffer schema
compiler can be patched or hooked in order to produce zero-copy variants of
ByteBuffer that use Memory/Span. Another option may be to forgo the FlatBuffer
compiler entirely and write a custom solution for serialization that otherwise
follows the FlatBuffer spec. Thoughts? You definitely have the correct idea but
the problem is you are modifying generated code, so we should be careful with
that.
3. It would be great if you could include those benchmarks in the solution to
help compare with future improvements. If you have the time for this, consider
the file system structure used by Entity Framework Core project:
[https://github.com/aspnet/EntityFrameworkCore]
4. Consider taking a look at recent changes to ArrowBufferBuilder<T>. This was
recently changed to use managed memory until the buffer is "built". The
intention here was a) make it possible to build multiple copies of an array
backed by separate memory, b) reuse the buffer builder. I'm not sure that this
was a good idea. Do you have any thoughts here? The approach that Rust takes is
to build the buffer up using the typical Arrow memory allocation strategies,
then "freeze" the buffer memory into a new Arrow buffer. This implies a single
builder per buffer. That's not necessarily a bad thing. It means (essentially)
one allocation per buffer instead of (possibly) two. It may be better for API
usability and in the spirit of minimizing allocations. Thoughts here?
Thanks for taking a look. I'm sure there's many areas of improvement. Let me
know if you have any more questions. I look forward to collaborating with you!
> [C#] Add support for zero-copy reads
> ------------------------------------
>
> Key: ARROW-4502
> URL: https://issues.apache.org/jira/browse/ARROW-4502
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C#
> Reporter: Eric Erhardt
> Assignee: Eric Erhardt
> Priority: Major
> Labels: performance
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> In the Python (and C++) API, you can create a `RecordBatchStreamReader`, and
> if you give it an `InputStream` that supports zero-copy reads, you can get
> back `RecordBatch` objects without allocating new memory and copying all the
> data.
> There is currently no way to read Arrow RecordBatch instances without
> allocating new memory and copying all the data. We should enable this
> scenario in the C# API.
>
> My proposal is to create a new `class ArrowRecordBatchReader : IArrowReader`.
> It's constructor will take a `ReadOnlyMemory<byte> data` parameter, and it
> will be able to read `RecordBatch` instances just like the existing
> `ArrowStreamReader`. As part of this new class, we will refactor any common
> code out of `ArrowStreamReader` in order for the parsing logic to be shared,
> where necessary.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)