Re: [Format] Clarifying Sort Order Requirements for Floating Points and Logical Types

2022-12-07 Thread Micah Kornfield
https://github.com/apache/parquet-format/pull/185 has been merged.

On Fri, Nov 4, 2022 at 9:54 PM Micah Kornfield 
wrote:

> A new proposal for adding a logical annotation to support Float16 values
> [1]  reopened the discussion on specifying how parquet should deal with
> edge cases for floating point types (PARQUET-1222 [2]).
>
> To try to resolve this the consensus from the JIRA is to not try to
> specify an ordering when writing but only rules but rather only specify
> rules for reading data. The rules where already present in the
> parquet.thrift file [3]. They are:
>
>>
>>
>>* - If the min is a NaN, it should be ignored.
>>* - If the max is a NaN, it should be ignored.
>>* - If the min is +0, the row group may contain -0 values as well.
>>* - If the max is -0, the row group may contain +0 values as well.
>>* - When looking for NaN values, min and max should be ignored.
>
>
> I've created a PR [4] to update README.md in parquet-format that:
> 1.  Specifies statistics should not be used when a column has an unknown
> logical type since correct comparisons cannot be performed.
> 2.  Specifies the ordering for primitive types and references the
> parquet.thrift for the details on how to handle floating point values.
>
> Feedback and other ideas are welcome.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/parquet-format/pull/184
> [2] https://issues.apache.org/jira/browse/PARQUET-1222
> [3]
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L897
> [4] https://github.com/apache/parquet-format/pull/185
>
>


[Format] Clarifying Sort Order Requirements for Floating Points and Logical Types

2022-11-04 Thread Micah Kornfield
A new proposal for adding a logical annotation to support Float16 values
[1]  reopened the discussion on specifying how parquet should deal with
edge cases for floating point types (PARQUET-1222 [2]).

To try to resolve this the consensus from the JIRA is to not try to specify
an ordering when writing but only rules but rather only specify rules for
reading data. The rules where already present in the parquet.thrift file
[3]. They are:

>
>
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.


I've created a PR [4] to update README.md in parquet-format that:
1.  Specifies statistics should not be used when a column has an unknown
logical type since correct comparisons cannot be performed.
2.  Specifies the ordering for primitive types and references the
parquet.thrift for the details on how to handle floating point values.

Feedback and other ideas are welcome.

Thanks,
Micah

[1] https://github.com/apache/parquet-format/pull/184
[2] https://issues.apache.org/jira/browse/PARQUET-1222
[3]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L897
[4] https://github.com/apache/parquet-format/pull/185