[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318280#comment-16318280
 ] 

Lars Volker commented on PARQUET-1065:
--------------------------------------

My understanding is that primitive types (INT32, INT64) use little-endian 
order, INT96 might do the same, though it's not documented explicitly in 
parquet.thrift. Both fields in INT96 timestamps (time and date) are encoded as 
little endian, too, so interpreting the resulting 12 bytes as an unsigned 12 
byte integer stored as little endian should give the correct order, no?

A 8 byte timestamp with bytes T0..T7 and 4 byte date with bytes D0..D3 would be 
stored like this example. Memory addresses increase to the right, the first row 
is a 12 byte integer in little endian order:

|I0|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|
|T0|T1|T2|T3|T4|T5|T6|T7|D0|D1|D2|D3|

Comparing the resulting timestamp as an int96 would compare the most 
significant byte first, which is stored at the highest address (I11, D3). 
Logically, this will compare by date first, then by timestamp.

[~zi] - Am I missing something?

> Deprecate type-defined sort ordering for INT96 type
> ---------------------------------------------------
>
>                 Key: PARQUET-1065
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1065
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Zoltan Ivanfi
>            Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to