TP Boudreau created ARROW-5618:
----------------------------------
Summary: Using deprecated Int96 storage for timestamps triggers
integer overflow in some cases
Key: ARROW-5618
URL: https://issues.apache.org/jira/browse/ARROW-5618
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: TP Boudreau
When storing Arrow timestamps in Parquet files using the Int96 storage format,
certain combinations of array lengths and validity bitmasks cause an integer
overflow error on read. It's not immediately clear whether the Arrow/Parquet
writer is storing zeroes when it should be storing positive values or the
reader is attempting to calculate a nanoseconds value inappropriately from
zeroed inputs (perhaps missing the null bit flag). Also not immediately clear
why only certain length columns seem to be affected.
Probably the quickest way to reproduce this undefined behavior is to alter the
existing unit test UseDeprecatedInt96 (in file
.../arrow/cpp/src/parquet/arrow/arrow-reader-writer-test.cc) by quadrupling its
column lengths (repeating the same values), followed by 'make unittest' using
clang-7 with sanitizers enabled. (Here's a patch applicable to current master
that changes the test as described: [1]; I used the following cmake command to
build my environment: [2].) You should get a log something like [3]. If
requested, I'll see if I can put together a stand-alone minimal test case that
induces the behavior.
The quick-hack at [4] will prevent integer overflows, but this is only included
to confirm the proximate cause of the bug: the Julian days field of the Int96
appears to be zero, when a strictly positive number is expected.
I've assigned the issue to myself and I'll start looking into the root cause of
this.
[1] https://gist.github.com/tpboudreau/b6610c13cbfede4d6b171da681d1f94e
[2] https://gist.github.com/tpboudreau/59178ca8cb50a935aab7477805aa32b9
[3] https://gist.github.com/tpboudreau/0c2d0a18960c1aa04c838fa5c2ac7d2d
[4] https://gist.github.com/tpboudreau/0993beb5c8c1488028e76fb2ca179b7f
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)