[ 
https://issues.apache.org/jira/browse/ORC-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106232#comment-17106232
 ] 

Panagiotis Garefalakis edited comment on ORC-629 at 5/13/20, 12:14 PM:
-----------------------------------------------------------------------

Attaching a patch FIXing the ambiguous sorting order of double values as part 
of their ColumnStatisticsImpl (to prevent data loss at the read path when 
filtering) by:
* Ignoring statistics when they contains a NaN value
* Using -0.0 as min value and +0.0 as max value to ensure that no 0.0 values 
would be skipped

As a second step I believe we should also define a total sorting order for 
doubles similar to what Java Double comparator does -- that should be respected 
across implementations such as Java/Cpp.


was (Author: pgaref):
Attaching a patch FIXing the ambiguous sorting order of double values as part 
of their ColumnStatisticsImpl (to prevent data loss at the read path when 
filtering) by:
* Ignoring statistics when they contains a NaN value
* Using -0.0 as min value and +0.0 as max value to ensure that no 0.0 values 
would be skipped

As a second step I believe we should also define a sorting order for doubles 
similar to what Java Double comparator does -- that should be respected across 
implementations such as Java/Cpp.

> PPD: Floating point NaN is not transitive across comparisons
> ------------------------------------------------------------
>
>                 Key: ORC-629
>                 URL: https://issues.apache.org/jira/browse/ORC-629
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Gopal Vijayaraghavan
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>
> Range comparisons don't work right for columns which start with Double.NaN as 
> the first row (min == max == NaN). 
> 1 < NaN is false.
> 1 > NaN is false.
> {code}
> File Version: 0.12 with ORC_135
> Rows: 3
> Compression: ZLIB
> Compression size: 32768
> Type: 
> struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<c:double>>
> Stripe Statistics:
>   Stripe 1:
>     Column 0: count: 3 hasNull: false
>     Column 1: count: 3 hasNull: false bytesOnDisk: 5 min: 0 max: 0 sum: 0
>     Column 2: count: 3 hasNull: false bytesOnDisk: 5 min: 1 max: 1 sum: 3
>     Column 3: count: 3 hasNull: false bytesOnDisk: 8 min: 536870912 max: 
> 536870912 sum: 1610612736
>     Column 4: count: 3 hasNull: false bytesOnDisk: 7 min: 0 max: 2 sum: 3
>     Column 5: count: 3 hasNull: false bytesOnDisk: 5 min: 1 max: 1 sum: 3
>     Column 6: count: 3 hasNull: false
>     Column 7: count: 3 hasNull: false bytesOnDisk: 19 min: NaN max: NaN sum: 
> NaN
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to