[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318446#comment-16318446
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 2/13/18 10:44 AM:
------------------------------------------------------------------

I think it is worth going through an example, the timestamp '2000-01-01 
12:34:56' stored as an int96 and dumped with parquet-tools:
{noformat}
$ parquet-tools dump 
hdfs://n1/user/hive/warehouse/test/11481f03a2ea6bed-b19656cd00000000_1937418586_data.0.parq
 | tail -n 1
value 1: R:0 D:1 V:117253024523396126668760320
{noformat}
Since 117253024523396126668760320 = 0x60FD4B3229000059682500, the 12 bytes are 
00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the 
time and the date parts.

00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 
0x000029324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 
minutes + 56 seconds.

59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 
2451545 as the Julian day number, which [corresponds 
to|http://aa.usno.navy.mil/jdconverter?ID=AA&jd=2451545] 2000-01-01.

For correct ordering based purely on numerical value, in comparisons the 
example above should not be interpreted as 0x0060FD4B3229000059682500 = 
117253024523396126668760320 like it currently is, but as 
0x00256859000029324BFD6000 = 45223023200227578716446720 instead. But since we 
do not want to introduce a new comparison order for a different endianness, we 
should just deprecate the ordering for this type.


was (Author: zi):
I think it is worth going through an example, the timestamp '2000-01-01 
12:34:56' stored as an int96 and dumped with parquet-tools:

{noformat}
$ parquet-tools dump 
hdfs://n1/user/hive/warehouse/test/11481f03a2ea6bed-b19656cd00000000_1937418586_data.0.parq
 | tail -n 1
value 1: R:0 D:1 V:117253024523396126668760320
{noformat}

Since 117253024523396126668760320 = 0x60FD4B3229000059682500, the 12 bytes are 
00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the 
time and the date parts.

00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 
0x000029324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 
minutes + 56 seconds.

59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 
2451545 as the Julian day number, which [corresponds 
to|http://aa.usno.navy.mil/jdconverter?ID=AA&jd=2451545] 2000-01-01.

For correct ordering based purely on numerical value, in comparisons the 
example above should not be interpreted as 0x0060FD4B3229000059682500 = 
117253024523396126668760320 like it currently is, but as 
0x00256859000029324BFD6000 = 45223023200227578716446720 instead. But since we 
do not want to introduce a new little-endian comparison order, we should just 
deprecate the ordering for this type.

> Deprecate type-defined sort ordering for INT96 type
> ---------------------------------------------------
>
>                 Key: PARQUET-1065
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1065
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Zoltan Ivanfi
>            Assignee: Zoltan Ivanfi
>            Priority: Major
>             Fix For: 1.10.0
>
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to