[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318436#comment-16318436
 ] 

Zoltan Ivanfi commented on PARQUET-1065:
----------------------------------------

The Parquet specification does not talk about endianness (which is something 
that I think should be addressed), but it defines data in terms of Thrift 
structures and the language bindings (at least parquet-mr) directly use these 
Thrift structures for reading and writing. Based on the Thrift specification 
(and some actual data files as well), these Thrift structures have a big-endian 
byte order. To quote from the [Integer 
encoding|https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md#integer-encoding]
 section of the Thrift specification:

{quote}In the binary protocol integers are encoded with the most significant 
byte first (big endian byte order, aka network order). An int8 needs 1 byte, an 
int16 2, an int32 4 and an int64 needs 8 bytes.{quote}

However, please note that there is no int96 type here, so that really should be 
specified in Parquet Format, but given that all other int types have a 
big-endian byte order, I don't think any other choice would make sense for 
int96. (Parquet-tools already interperts int96 values according to this 
ordering). Impala, however, simply writes the 12 bytes of it's little-endian 
in-memory representation into the consecutive bytes of an int96, so the values 
are meaningless for less-than or greater-than comparisons.

> Deprecate type-defined sort ordering for INT96 type
> ---------------------------------------------------
>
>                 Key: PARQUET-1065
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1065
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Zoltan Ivanfi
>            Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to