[
https://issues.apache.org/jira/browse/ARROW-18115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623420#comment-17623420
]
Antoine Pitrou commented on ARROW-18115:
----------------------------------------
IIRC there isn't much contention on the PR (at least the fundamentals). It's
just that review bandwidth is scarce, and the release was higher priority.
> [C++] Acero buffer alignment
> ----------------------------
>
> Key: ARROW-18115
> URL: https://issues.apache.org/jira/browse/ARROW-18115
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> This is a general JIRA to centralize some discussion / proposal on a buffer
> alignment strategy for Acero based on discussions that have happened in a few
> different contexts. Any actual work will probably span multiple JIRAs, some
> of which are already in progress.
> Motivations:
> * Incoming data may not be aligned at all. Some kernel functions and parts
> of Acero (e.g. aggregation, join) use explicit SIMD instructions (e.g.
> intrinsics) and will fail / cause corruption if care is not taken to use
> unaligned loads (e.g. ARROW-17783). There are parts of the code that assume
> fixed arrays with size T are at least T aligned. This is generally a safe
> assumption for data generated by arrow-c++ (which allocates all buffers with
> 64 byte alignment) but then leads to errors when processing data from flight.
> * Dataset writes and mid-plan spilling both can benefit form direct I/O, less
> for performance reasons and more for memory management reasons. However, in
> order to use direct I/O a buffer needs to be aligned, typically to 512 bytes.
> This is larger than the current minimum alignment requirements.
> Proposal:
> * Allow the minimum alignment of a memory pool to be configurable. This is
> similar to the proposal of ARROW-17836 but does not require much of an API
> change (at the expense of being slightly less flexible).
> * Add a capability to realign a buffer/array/batch/table to a target
> alignment. This would only modify buffers that are not already aligned.
> Basically, given an arrow object, and a target memory pool, migrate a buffer
> to the target memory pool if its alignment is insufficient.
> * Acero, in the source node, forces all buffers to be 64 byte aligned. This
> way the internals of Acero don't have to worry about this case. This
> introduces a performance penalty when buffers are not aligned but is much
> simpler to maintain and test than trying to support any random buffer. To
> avoid this penalty it would be simpler to avoid the unaligned buffers in the
> first place.
> * Acero requires a memory pool that has 512-byte alignment so that
> Acero-allocated buffers can use direct I/O. If the default memory pool does
> not have 512-byte alignment then Acero can use a per-plan pool. This covers
> the common case for spilling and dataset writing which is that we are
> partitioning prior to the write and so we are writing Acero-allocated buffers
> anyways.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)