[ 
https://issues.apache.org/jira/browse/ARROW-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517637#comment-17517637
 ] 

Weston Pace commented on ARROW-16118:
-------------------------------------

One note is that this will only be possible when using a local filesystem.  
Object stores usually do not support modification of files after they have been 
uploaded.  So we might need to add a flag to a filesystem to ask whether it 
supports this but that can be useful.  On spinning disk filesystems (i.e. HDD) 
the seek penalty may cause this to be less (or potentially negative) of a 
benefit if there is not much RAM pressure.

However, I could see this being useful in cases where we are writing to a very 
fast solid state drive.

Another memory-usage issue that might be related since we are talking about 
writes is ARROW-14635.  I'd like to move to direct I/O (or at least being 
smarter about OS page cache) when writing large datasets since we otherwise end 
up causing the system to swap for no good reason.

> [C++] Reduce memory usage when writing to IPC
> ---------------------------------------------
>
>                 Key: ARROW-16118
>                 URL: https://issues.apache.org/jira/browse/ARROW-16118
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jorge Leitão
>            Priority: Major
>
> Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) 
> where N is the average size of the buffer and B the number of buffers in the 
> recordbatch.
> This is because we need the buffer location and total number of bytes to 
> write the header of the record, which is only known after e.g. knowning by 
> how much the buffers were compressed.
> When the writer supports seeking, this memory usage can be reduced to O(N) 
> where N is the average size of a primitive buffer over all fields. This is 
> done using the following pseudo-code implementation:
> {code:java}
> start = writer.seek(current);
> empty_locations = create_empty_header(schema)
> write_header(writer, empty_locations)
> locations = write_buffers(writer, batch)
> writer.seek(start)
> write_header(writer, locations)
> {code}
> This has a significantly lower memory footprint. O(N) vs O(N*B)
> It could be interesting for the C++ implementation to support this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to