[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644456#comment-17644456
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

emkornfield commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341349709

   > I don't feel it would require a formal vote. In my view it was more a 
missing part of the spec that is made clear.
   It might worth a heads up on the dev list, though.
   
   Replied to my original thread that this has been merged.




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
> Fix For: format-2.10.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1766#comment-1766
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

gszadovszky commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341331959

   I don't feel it would require a formal vote. In my view it was more a 
missing part of the spec that is made clear. 
   It might worth a heads up on the dev list, though.




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
> Fix For: format-2.10.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1762#comment-1762
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

pitrou commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1341327375

   > @pitrou thanks for merging, I'm not sure if we needed an official vote on 
this?
   
   Oops, that's a good question. I might have merged too quickly. @gszadovszky 
What is your take on this?
   




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Micah Kornfield
>Priority: Critical
> Fix For: format-2.10.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644198#comment-17644198
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

pitrou commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1340540614

   Thanks @emkornfield !




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644197#comment-17644197
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

pitrou merged PR #185:
URL: https://github.com/apache/parquet-format/pull/185




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644149#comment-17644149
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

emkornfield commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1340395605

   @gszadovszky @pitrou thanks for the review.  I believe I incorporated the 
remaining feedback.




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644148#comment-17644148
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

emkornfield commented on code in PR #185:
URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041775669


##
README.md:
##
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1.  Each logical type has a specified comparison order. If a column is
+annotated with an unknown logical type, statistics may not be used
+for pruning data. The sort order for logical types is documented in
+the [LogicalTypes.md][logical-types] page.
+2.  For primitives the following sort orders apply:
+
+* BOOLEAN - false, true
+* INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values 
are
+  not totally ordered due to special case like NaN. They require special
+  handling when reading statistics. The details are documented in 
parquet.thrift in the
+  `ColumnOrder` union. They are summarized 

Review Comment:
   done.





> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643902#comment-17643902
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

pitrou commented on code in PR #185:
URL: https://github.com/apache/parquet-format/pull/185#discussion_r1041067303


##
README.md:
##
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:

Review Comment:
   ```suggestion
   Parquet stores min/max statistics at several levels (such as Column Chunk,
   Column Index and Data Page). Comparison for values of a type obey the
   following rules:
   ```



##
README.md:
##
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1.  Each logical type has a specified comparison order. If a column is
+annotated with an unknown logical type, statistics may not be used
+for pruning data. The sort order for logical types is documented in
+the [LogicalTypes.md][logical-types] page.
+2.  For primitives the following sort orders apply:
+
+* BOOLEAN - false, true
+* INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values 
are
+  not totally ordered due to special case like NaN. They require special
+  handling when reading statistics. The details are documented in 
parquet.thrift in the
+  `ColumnOrder` union. They are summarized 

Review Comment:
   Need to synchronize this with the final wording from `parquet.thrift`.



##
README.md:
##
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1.  Each logical type has a specified comparison order. If a column is
+annotated with an unknown logical type, statistics may not be used
+for pruning data. The sort order for logical types is documented in
+the [LogicalTypes.md][logical-types] page.
+2.  For primitives the following sort orders apply:

Review Comment:
   ```suggestion
   2.  For primitive types, the following rules apply:
   ```



##
src/main/thrift/parquet.thrift:
##
@@ -902,6 +902,13 @@ union ColumnOrder {
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
+   * 
+   * When writing statistics the following rules should be followed:
+   * - NaNs should not be written to min or max statistics fields.
+   * - Only -0 should be written into min statistics fields (if only 
+   *   +0 is present in the column it should be converted to -0.0).
+   * - Only +0 should be written into a max statistics fields (if 
+   *   only -0 is present it must be convereted to +0).

Review Comment:
   Suggestion to make wording clearer.
   ```suggestion
  * - If the computed max value is zero (whether negative or positive),
  *   `+0.0` should be written into the max statistics field.
  * - If the computed min value is zero (whether negative or positive),
  *   `-0.0` should be written into the min statistics field.
   ```



##
README.md:
##
@@ -144,6 +144,38 @@ documented in [LogicalTypes.md][logical-types].
 
 [logical-types]: LogicalTypes.md
 
+### Sort Order
+
+Parquet stores min/max statistics at several levels (e.g. RowGroup, Page Index,
+etc). Comparison for values of a type follow the following logic:
+
+1.  Each logical type has a specified comparison order. If a column is
+annotated with an unknown logical type, statistics may not be used
+for pruning data. The sort order for logical types is documented in
+the [LogicalTypes.md][logical-types] page.
+2.  For primitives the following sort orders apply:
+
+* BOOLEAN - false, true
+* INT32, INT64, FLOAT, DOUBLE - Signed comparison. Floating point values 
are

Review Comment:
   I would suggest making a separate list item for floating-point:
   ```suggestion
   * INT32, INT64 - Signed comparison.
   * FLOAT, DOUBLE - Signed comparison with special handling of NaNs
 and signed zeros. The details are documented in...
   ```





> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> 

[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-12-05 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643692#comment-17643692
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

emkornfield commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1338880938

   @gszadovszky thanks for the feedback I tried to address it in the latest 
commit.




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-11-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17630594#comment-17630594
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

emkornfield commented on PR #185:
URL: https://github.com/apache/parquet-format/pull/185#issuecomment-1307760898

   @pitrou @gszadovszky 




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-11-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17629251#comment-17629251
 ] 

ASF GitHub Bot commented on PARQUET-1222:
-

emkornfield opened a new pull request, #185:
URL: https://github.com/apache/parquet-format/pull/185

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   
   




> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-10-10 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614907#comment-17614907
 ] 

Gabor Szadovszky commented on PARQUET-1222:
---

[~emkornfield],

There are a couple of docs in the parquet-format repo. The related ones are 
[about logical 
types|[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]] 
and the main one that contains the description of the [primitive 
types|https://github.com/apache/parquet-format/blob/master/README.md#types]. 
Unfortunately, the latter one does not contain anything about sorting order.
So, I think, we need to do the following:
* Define the sorting order for the primitive types or reference the logical 
types description for it. (In most cases it would be referencing since the 
ordering depends on the related logical types e.g. signed/unsigned sorting of 
integral types)
* After defining the sorting order of the primitive floating point numbers 
based on what we've discussed above reference it from the new half-precision FP 
logical type.

(Another unfortunate thing is that we have some specification-like docs at the 
[parquet site|https://parquet.apache.org] as well. I think we should propagate 
the parquet-format docs to there automatically or simply link them from the 
site. But it is clearly a different topic.)

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-10-08 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17614581#comment-17614581
 ] 

Micah Kornfield commented on PARQUET-1222:
--

Elevating the specification level seems fine.  I was under the impression the 
thrift file was the specification?  Where do we need to do the PR to elevate 
them?

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-09-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611445#comment-17611445
 ] 

Antoine Pitrou commented on PARQUET-1222:
-

I agree with [~gszadovszky] for elevating these rules at the specification 
level.

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-09-30 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611444#comment-17611444
 ] 

Antoine Pitrou commented on PARQUET-1222:
-

(side note: the ML is mostly a firehose of notifications nowadays, which 
doesn't make it easy to follow...)

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-09-30 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611398#comment-17611398
 ] 

Gabor Szadovszky commented on PARQUET-1222:
---

[~emkornfield], I think we do not need to handle NaN values with a boolean to 
fix this issue. NaN is kind of similar than null values so we may even count 
them instead of having a boolean but this question is not tightly related to 
this topic.
What do you think about elevating the current suggestion in the thrift file to 
specification level for writing/reading FP min/max values?
{quote}Because the sorting order is not specified properly for floating point 
values (relations vs. total ordering) the following compatibility rules should 
be applied when reading statistics:
* If the min is a NaN, it should be ignored.
* If the max is a NaN, it should be ignored.
* If the min is +0, the row group may contain -0 values as well.
* If the max is -0, the row group may contain +0 values as well.
* When looking for NaN values, min and max should be ignored.{quote}
For writing we shall skip NaN values and use -0 for min and +0 for max any time 
when a 0 is to be taken into account.

With this solution we cannot do anything clever in case of searching for a NaN 
but it can be fixed separately. And we also need to double-check whether we 
really ignore the min/max stats in case of searching for a NaN.

I think it is a good idea to discuss such topics on the mailing list. However, 
we should also time-box the discussion and go forward with a proposed solution 
if there are no interests on the mailing list. (Personally, I do not follow the 
dev list anymore.)


> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2022-09-29 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17611356#comment-17611356
 ] 

Micah Kornfield commented on PARQUET-1222:
--

I'd propose the following "fix":
- Add a new optional bool value to the statistics  struct "contains_nan".  When 
unset, I think we specify the semantics for comparisons relative to -0.0/0.0 
and NaN, etc are not well defined and implementations have taken different 
routes.
- When set, if true, it means the column contains at least one NaN, when set to 
false it means no NaNs are present.  Further when set, it implies the following 
ordering:
NaNs are never included in Min/Max statistics in the struct.  -0.0, +0.0, are 
considered two distinct values and are ordered according to sign.

Thoughts?  Should I bring this up on the mailing list?

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2021-04-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316469#comment-17316469
 ] 

Antoine Pitrou commented on PARQUET-1222:
-

Some answers after looking through the code:
* parquet-cpp does not read nor write ColumnIndex
* our handling of min_value and max_value on the read path is naive. We use the 
same comparisons regardless of whether ColumnOrder is present or not. In 
particular, we use native type-specific greater-or-equal comparison (e.g. 
floating-point comparison), which is due to fail with NaNs (but will succeed 
with signed zeros).



> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2021-04-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316180#comment-17316180
 ] 

Gabor Szadovszky commented on PARQUET-1222:
---

[~apitrou], I guess what you've described is the write path of the statistics. 
Because you cannot control other writers I would suggest following the [spec 
for the read 
path|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L892-L899].
Meanwhile, I've done some investigation in the parquet-mr code and the format 
and there are issues related to this topic.
* We have created the ColumnOrder object and the related field in the format to 
specify the ordering of the columns and to prepare for the potential solution 
of this (and similar) issues. We are referencing this field in the Statistics 
object used for row-group level stats. Meanwhile, we do not reference this in 
the column indexes. So, in column indexes it is not clear what sorting orders 
do we want to use and how to handle cases like this. How it is implemented in 
parquet-cpp?
* Based on the referenced workaround we handle the special floating point 
values at row-group level in parquet-mr but only for the read path. For the 
write path we still write these values.
* For column indexes we handle these values but only for the write path and not 
for the read path. 

So, we have a couple of issues around this topic and it would be great if we 
would have a final and well defined solution for it.

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2021-04-04 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314456#comment-17314456
 ] 

Antoine Pitrou commented on PARQUET-1222:
-

I'll note that Parquet C++ now has the following behaviour:

* signed zeros are properly ordered (ARROW-5562)
* NaNs are ignored when computing min/max (PARQUET-1225); if a page or column 
chunk only has NaNs, the statistics are unset


> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1222) Specify a well-defined sorting order for float and double types

2018-03-28 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417460#comment-16417460
 ] 

Zoltan Ivanfi commented on PARQUET-1222:


I updated this JIRA to distuingish it from PARQUET-1251. To summarize:
 * PARQUET-1251 is a "hotfix" that describes a workaround for handling 
statistics written using the ambiguous specification.
 * This JIRA is about specifying a well-defined sort order.

> Specify a well-defined sorting order for float and double types
> ---
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than +0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C++ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)