Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15403 )

Change subject: IMPALA-6505: Min-Max predicate push down in ORC scanner
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15403/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15403/1//COMMIT_MSG@19
PS1, Line 19:  - Run scanner tests on ORC files.
Some ideas for more tests:

https://github.com/apache/impala/blob/master/tests/query_test/test_parquet_stats.py

https://github.com/apache/impala/blob/master/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index.test

Checking if the filtering is too strict is simple, as that leads to having less 
rows in the results. To check whether is actually works we relied on the 
profile, e.g. by checking NumStatsFilteredPages counter. Adding these counters 
is harder as ORC is a separate lib, but we could probably rely on others, e.g. 
number of rows we had to scan (before evaluating per-row filters).


http://gerrit.cloudera.org:8080/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc@875
PS1, Line 875: UtcToUnixTime
According to ORC spec the timestamp stats are stored as millisecs (which is 
quite surprising, as the data itself is stored as secs + nanos). I looked 
around in ORC code and I didn't see any conversion before comparing value to 
min/max from stats. This is something that should be documented in the ORC 
interface ( 
https://github.com/apache/impala/blob/master/be/src/exec/orc-column-readers.cc#L250
 ) as this is far from being intuitive.

So the function to use here is FloorUtcToUnixTimeMillis() instead of 
UtcToUnixTime().

The "floor" part points to another issue here: in a lossy/rounding conversion 
as this, we have to round the MIN limit towards negative infinity, while round 
the MAX limit towards positive infinity. For example assume that all values are 
0.5 ms (from UNIX epoch) the predicate is checking < 0.6 ms. A naive solution 
would round both MIN and MAX to 0 ms. Even if the stat writer was smart and 
saved 0 ms as minimumUtc and 1 ms as maximumUtc in the Orc file, the < 0 ms 
predicate would be evaluated to false.

It would be probably the best to push this logic into the ORC lib: Literal 
could get full nanosec precision timestamps and convert the min/max stats from 
the ORC file to nanosec before using them. I didn't check whether the ORC 
writers round the max stats differently than the min stats - if not, (and there 
are files where both were rounded in the same direction), then the logic has to 
be even more tolerant.


http://gerrit.cloudera.org:8080/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc@959
PS1, Line 959:     orc::Literal literal =
             :         GetLiteralSearchArguments(eval, slot_desc->type(), 
&predicate_type);
             :
             :     if (fn_name == "lt") {
This logic is not enough if there is some mismatch between Impala's and ORC's 
representation, as we may have to widen the range we check, so lower/raise the 
literal depending on whether it is a lower or upper bound.

An example is varchar (also explained in a previous comment at line 888):
e.g if 's' is VARCHAR(1) and the predicate is 'c>="b" and c<="d", then passing 
"b" as lower bound is ok, but passing "d" as  upper bound is not, a "da" will 
fit into the range after truncation but it is < than "d". A solution is to pass 
"e" here as upper bound, which may lead to some false positives but not false 
negatives.

Another example is the rounded timestamp stat, see comment at line 875.



--
To view, visit http://gerrit.cloudera.org:8080/15403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Gerrit-Change-Number: 15403
Gerrit-PatchSet: 1
Gerrit-Owner: Norbert Luksa <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Norbert Luksa <[email protected]>
Gerrit-Comment-Date: Wed, 11 Mar 2020 20:42:21 +0000
Gerrit-HasComments: Yes

Reply via email to