[
https://issues.apache.org/jira/browse/ARROW-18141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou updated ARROW-18141:
-----------------------------------
Fix Version/s: 11.0.0
> Alignment not enforced; undefined behavior
> ------------------------------------------
>
> Key: ARROW-18141
> URL: https://issues.apache.org/jira/browse/ARROW-18141
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Reporter: Jochen Ott
> Priority: Major
> Fix For: 11.0.0
>
> Attachments: test1.py
>
>
> It is possible to create arrays using unaligned memory addresses (e.g. for
> int64). This seems to be in line with the arrow specification which as far as
> I understand does not require alignment [1].
> However, the C++ standard requires alignment, e.g. 8 byte alignment for
> int64. It is undefined behavior (UB) to create an unaligned pointer /
> accessing data via an unaligned pointer.
> Typically, this is not an issue in practice on x86, since gcc and other
> compilers mostly emit instructions that can deal with unaligned data.
> However, for gcc 6.3.0 (and probably up to including gcc versions 7.X), this
> code:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/statistics.cc#L355]
> creates an aligned move instruction (movdqa) for the expression
> {{{}values[i]{}}}. This, in turn, triggers a SIGSEGV in case {{values}} is
> called via an unaligned buffer. Later compiler versions (in particular gcc
> 9.X used to build the wheels published on pypi) will emit instructions that
> can deal with unaligned data (movdqu instead of movdqa).
> The python script "test1.py" reproduces this issue on python-level; note that
> it will only trigger a SIGSEGV if compiling arrow with a compiler that emits
> movdqa for the code linked above, e.g. by using gcc 6.3.0 to compile arrow.
> In the wild, unaligned buffers are rare, but can appear, e.g. as a result of
> deserializing pandas dataframes / numpy arrays using pickle protocol 5 that
> allows out-of-band byte buffers that are re-used as arrow array buffers.
> I think the line to first enter the UB regime is this reinterpret_cast:
> https://github.com/apache/arrow/blob/33f2c0ec8e281fc4fe8c03b07ed2d32e343d9b0e/cpp/src/parquet/column_writer.cc#L1592
> [1][https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding]
> merely "recommends" that buffers are aligned, but does not require it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)