[ 
https://issues.apache.org/jira/browse/ARROW-6149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Felton updated ARROW-6149:
---------------------------------
    Description: 
The [Parquet Format 
specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
 says

bq. If the column uses int32 or int64 physical types, then signed comparison of 
the integer values produces the correct ordering. If the physical type is 
fixed, then the correct ordering can be produced by flipping the 
most-significant bit in the first byte and then using unsigned byte-wise 
comparison.

However this isn't followed in the C++ Parquet code. 16-byte decimal comparison 
is implemented using a lexicographical comparison of signed chars.

This appears to be because the function 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
 just goes off the sort_order (signed) and physical_type 
(FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.

  was:
The 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet 
Format specifications] says

bq. If the column uses int32 or int64 physical types, then signed comparison of 
the integer values produces the correct ordering. If the physical type is 
fixed, then the correct ordering can be produced by flipping the 
most-significant bit in the first byte and then using unsigned byte-wise 
comparison.

However this isn't followed in the C++ Parquet code. 16-byte decimal comparison 
is implemented using a lexicographical comparison of signed chars.

This appears to be because the function 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
 just goes off the sort_order (signed) and physical_type 
(FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.


> [Parquet] Decimal comparisons used for min/max statistics are not correct
> -------------------------------------------------------------------------
>
>                 Key: ARROW-6149
>                 URL: https://issues.apache.org/jira/browse/ARROW-6149
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Philip Felton
>            Priority: Major
>
> The [Parquet Format 
> specifications|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
>  says
> bq. If the column uses int32 or int64 physical types, then signed comparison 
> of the integer values produces the correct ordering. If the physical type is 
> fixed, then the correct ordering can be produced by flipping the 
> most-significant bit in the first byte and then using unsigned byte-wise 
> comparison.
> However this isn't followed in the C++ Parquet code. 16-byte decimal 
> comparison is implemented using a lexicographical comparison of signed chars.
> This appears to be because the function 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
>  just goes off the sort_order (signed) and physical_type 
> (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to