Philip Felton created ARROW-6149:
------------------------------------
Summary: [Parquet] Decimal comparisons used for min/max statistics
are not correct
Key: ARROW-6149
URL: https://issues.apache.org/jira/browse/ARROW-6149
Project: Apache Arrow
Issue Type: Bug
Reporter: Philip Felton
The
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet
Format specifications] says
bq. If the column uses int32 or int64 physical types, then signed comparison of
the integer values produces the correct ordering. If the physical type is
fixed, then the correct ordering can be produced by flipping the
most-significant bit in the first byte and then using unsigned byte-wise
comparison.
However this isn't followed in the C++ Parquet code. 16-byte decimal comparison
is implemented using a lexicographical comparison of signed chars.
This appears to be because the function
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
just goes off the sort_order (signed) and physical_type
(FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)