[
https://issues.apache.org/jira/browse/ARROW-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525893#comment-17525893
]
Micah Kornfield commented on ARROW-16118:
-----------------------------------------
Also, we should be careful how this enabled, since if someone is actually
consuming the stream in real-time there would need to be some sort of
coordination to ensure bytes aren't sent prematurely.
> [C++] Reduce memory usage when writing to IPC
> ---------------------------------------------
>
> Key: ARROW-16118
> URL: https://issues.apache.org/jira/browse/ARROW-16118
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jorge Leitão
> Priority: Major
>
> Writing a record batch to IPC ([header][buffers]) currently requires O(N*B)
> where N is the average size of the buffer and B the number of buffers in the
> recordbatch.
> This is because we need the buffer location and total number of bytes to
> write the header of the record, which is only known after e.g. knowning by
> how much the buffers were compressed.
> When the writer supports seeking, this memory usage can be reduced to O(N)
> where N is the average size of a primitive buffer over all fields. This is
> done using the following pseudo-code implementation:
> {code:java}
> start = writer.seek(current);
> empty_locations = create_empty_header(schema)
> write_header(writer, empty_locations)
> locations = write_buffers(writer, batch)
> writer.seek(start)
> write_header(writer, locations)
> {code}
> This has a significantly lower memory footprint. O(N) vs O(N*B)
> It could be interesting for the C++ implementation to support this.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)