Jochen Ott created ARROW-18141:
----------------------------------

             Summary: Alignment not enforced; undefined behavior
                 Key: ARROW-18141
                 URL: https://issues.apache.org/jira/browse/ARROW-18141
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
            Reporter: Jochen Ott
         Attachments: test1.py

It is possible to create arrays using unaligned memory addresses (e.g. for 
int64). This seems to be in line with the arrow specification which as far as I 
understand does not require alignment [1].

However, the C++ standard requires alignment, e.g. 8 byte alignment for int64. 
It is undefined behavior (UB) to create an unaligned pointer / accessing data 
via an unaligned pointer.

Typically, this is not an issue in practice on x86, since gcc and other 
compilers mostly emit instructions that can deal with unaligned data. However, 
for gcc 6.3.0 (and probably up to including gcc versions 7.X), this code:

[https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/statistics.cc#L355]

creates an aligned move instruction (movdqa) for the expression 
{{{}values[i]{}}}. This, in turn, triggers a SIGSEGV in case {{values}} is 
called via an unaligned buffer. Later compiler versions (in particular gcc 9.X 
used to build the wheels published on pypi) will emit instructions that can 
deal with unaligned data (movdqu instead of movdqa).

The python script "test1.py" reproduces this issue on python-level; note that 
it will only trigger a SIGSEGV if compiling arrow with a compiler that emits 
movdqa for the code linked above, e.g. by using gcc 6.3.0 to compile arrow.

In the wild, unaligned buffers are rare, but can appear, e.g. as a result of 
deserializing pandas dataframes / numpy arrays using pickle protocol 5 that 
allows out-of-band byte buffers that are re-used as arrow array buffers.

I think the line to first enter the UB regime is this reinterpret_cast:


https://github.com/apache/arrow/blob/33f2c0ec8e281fc4fe8c03b07ed2d32e343d9b0e/cpp/src/parquet/column_writer.cc#L1592

[1][https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding]
 merely "recommends" that buffers are aligned, but does not require it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to