[ 
https://issues.apache.org/jira/browse/ORC-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley resolved ORC-611.
-------------------------------
    Fix Version/s: 1.7.0
                   1.6.4
       Resolution: Fixed

I just committed this. Thank you, Panos!

> Incorrect min-max stats for sub-millisecond timestamps
> ------------------------------------------------------
>
>                 Key: ORC-611
>                 URL: https://issues.apache.org/jira/browse/ORC-611
>             Project: ORC
>          Issue Type: Bug
>          Components: C++, Java
>            Reporter: Csaba Ringhofer
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>             Fix For: 1.6.4, 1.7.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The issue is related to the precision of storing timestamps:
> - nanoseconds for the data itself
> - only milliseconds for min-max statistics
> Both min and max are rounded to the same value, while min should be rounded 
> down and max should be rounded up to ensure that the values are actually 
> within that range.
> Repro in Hive:
> {code}
> create table tsstat (ts timestamp) stored as orc;
> insert into tsstat values ("1970-01-01 00:00:00.0005")
> select * from tsstat where ts = "1970-01-01 00:00:00.0005";
> -- returned 0 rows
> {code}
> Both the Java and the C++ writer has this issue (thanks [~stigahuang] for 
> looking them up):
> https://github.com/apache/orc/blob/fea154436c37c81a16b13d879b510096cfaa2946/java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java#L108
> https://github.com/apache/orc/blob/fea154436c37c81a16b13d879b510096cfaa2946/c%2B%2B/src/ColumnWriter.cc#L1800
> I guess that there are already files with this issue in production, so I 
> think that the only way to fix this is to hack the reader:
> - decrease/increase min/max stats with 1 ms after reading them from file
> - also be careful about the values pushed down, as the same precision loss 
> can occur there to, eg. "WHERE ts <'1970-01-01 00:00:00.0005' AND ts > 
> '1970-01-01 00:00:00.0004'" shouldn't be converted to ts < "1970-01-01" AND 
> ts > "1970-01-01"
> The issue was discovered during an Impala review: 
> https://gerrit.cloudera.org/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc@875



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to