[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2018-01-09 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318547#comment-16318547
 ] 

Zoltan Ivanfi commented on PARQUET-1065:


[~lv], we can not change the ordering for the existing {{min}} and {{max}} 
fields for int96 timestamps, because statistics were already written for them 
according to the wrong byte order.

We do not want to define a new int96 ordering for the new {{min-value}} and 
{{max-value}} fields either, because:

# We can distuingish between a timestamps stored in an int96 that requires 
little-endian ordering and an actual int96 that requires big-endian ordering.
# Introducing little-endian ordering would put an unnecessary burden on the 
implementors for the sake of a legacy type that we would like to get rid of.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2018-01-09 Thread Lars Volker (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318529#comment-16318529
 ] 

Lars Volker commented on PARQUET-1065:
--

Thank you [~zi] for the explanation and the example. I got the idea of little 
endian encodings from [here in 
parquet.thrift|https://github.com/apache/parquet-format/blob/a00e770cb301506f6288d11d6532f2635a8cd349/src/main/thrift/parquet.thrift#L400],
 but that refers to the plain encoding.

I'm a bit worried that changing the ordering from "unsigned" to "undefined" 
will not improve the confusion. Impala (and other engines) will still need to 
support reading the values and also may want to write and read statistics. Can 
we consider changing the ordering to something like "comparison of the 
represented value if used for legacy timestamps, undefined otherwise"?

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2018-01-09 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318446#comment-16318446
 ] 

Zoltan Ivanfi commented on PARQUET-1065:


I think it is worth going through an example, the timestamp '2000-01-01 
12:34:56' stored as an int96 and dumped with parquet-tools:

$ parquet-tools dump 
hdfs://n1/user/hive/warehouse/test/11481f03a2ea6bed-b19656cd_1937418586_data.0.parq
 | tail -n 1
value 1: R:0 D:1 V:117253024523396126668760320

Since 117253024523396126668760320 = 0x60FD4B322959682500, the 12 bytes are 
00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the 
time and the date parts.

00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 
0x29324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 
minutes + 56 seconds.

59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 
2451545 as the Julian day number, which corresponds to 2000-01-01.

For correct ordering based purely on numerical value, in comparisons the 
example above should not be interpreted as 0x0060FD4B322959682500 = 
117253024523396126668760320 like it currently is, but as 
0x0025685929324BFD6000 = 45223023200227578716446720 instead. But since we 
do not want to introduce a new little-endian comparison order, we should just 
deprecate the ordering for this type.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2018-01-09 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318436#comment-16318436
 ] 

Zoltan Ivanfi commented on PARQUET-1065:


The Parquet specification does not talk about endianness (which is something 
that I think should be addressed), but it defines data in terms of Thrift 
structures and the language bindings (at least parquet-mr) directly use these 
Thrift structures for reading and writing. Based on the Thrift specification 
(and some actual data files as well), these Thrift structures have a big-endian 
byte order. To quote from the [Integer 
encoding|https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md#integer-encoding]
 section of the Thrift specification:

{quote}In the binary protocol integers are encoded with the most significant 
byte first (big endian byte order, aka network order). An int8 needs 1 byte, an 
int16 2, an int32 4 and an int64 needs 8 bytes.{quote}

However, please note that there is no int96 type here, so that really should be 
specified in Parquet Format, but given that all other int types have a 
big-endian byte order, I don't think any other choice would make sense for 
int96. (Parquet-tools already interperts int96 values according to this 
ordering). Impala, however, simply writes the 12 bytes of it's little-endian 
in-memory representation into the consecutive bytes of an int96, so the values 
are meaningless for less-than or greater-than comparisons.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2018-01-09 Thread Lars Volker (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318280#comment-16318280
 ] 

Lars Volker commented on PARQUET-1065:
--

My understanding is that primitive types (INT32, INT64) use little-endian 
order, INT96 might do the same, though it's not documented explicitly in 
parquet.thrift. Both fields in INT96 timestamps (time and date) are encoded as 
little endian, too, so interpreting the resulting 12 bytes as an unsigned 12 
byte integer stored as little endian should give the correct order, no?

A 8 byte timestamp with bytes T0..T7 and 4 byte date with bytes D0..D3 would be 
stored like this example. Memory addresses increase to the right, the first row 
is a 12 byte integer in little endian order:

|I0|I1|I2|I3|I4|I5|I6|I7|I8|I9|I10|I11|
|T0|T1|T2|T3|T4|T5|T6|T7|D0|D1|D2|D3|

Comparing the resulting timestamp as an int96 would compare the most 
significant byte first, which is stored at the highest address (I11, D3). 
Logically, this will compare by date first, then by timestamp.

[~zi] - Am I missing something?

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-17 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16207226#comment-16207226
 ] 

Zoltan Ivanfi commented on PARQUET-1065:


The bytes of a timestamp are stored in the opposite order than the one used for 
comparing int96-s, therefore comparison won't give correct results. We could 
add the correct byte order to the specification, but that would not work if one 
tried to store actual integers instead of timestamps. We also have to take 
backwards compatibility into consideration.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206424#comment-16206424
 ] 

Deepak Majeti commented on PARQUET-1065:


If we treat Int96 as a primitive data type, then we must compare 
Int96(little-endian) in a reverse byte order. Then we will check the most 
significant bits first correct?

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16205986#comment-16205986
 ] 

Zoltan Ivanfi commented on PARQUET-1065:


Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-12 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202552#comment-16202552
 ] 

Deepak Majeti commented on PARQUET-1065:


INT96 timestamps can be sorted using both signed and unsigned sort orders.
The date values are always positive since they are Julian day numbers. 
Therefore, both orders should work.
Discussion on how the values must be compared is here: 
https://github.com/apache/parquet-format/pull/55


> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)