[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 2/13/18 10:41 AM:
------------------------------------------------------------------

Unfortunately, since INT96 timestamps are stored in the opposite byte order 
than how INT96 numbers are supposed to be stored, the value of the most 
significant byte of the number interpretation will vary wildly, spanning the 
whole range between 0x00 and 0xFF. As a result, when comparing the raw bytes, 
signed and unsigned comparison can lead to different results.

Edit: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store second precision only. In these cases, the 
least significant byte is always 0x00, so comparison signedness does not affect 
the order. The problem I described above is only present for sub-second 
precisions. (To be more exact, it affects precisions below 10^8 nanosec = 100 
msec, since 10^8 = 2^8 * 5^8 and the 2^8 part makes the least significant byte 
zero.)


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will vary 
wildly, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

Edit: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store second precision only. In these cases, the 
least significant byte is always 0x00, so comparison signedness does not affect 
the order. The problem I described above is only present for sub-second 
precisions. (To be more exact, it affects precisions below 10^8 nanosec = 100 
msec, since 10^8 = 2^8 * 5^8 and the 2^8 part makes the least significant byte 
zero.)

> Deprecate type-defined sort ordering for INT96 type
> ---------------------------------------------------
>
>                 Key: PARQUET-1065
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1065
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Zoltan Ivanfi
>            Assignee: Zoltan Ivanfi
>            Priority: Major
>             Fix For: 1.10.0
>
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to