Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15403 )

Change subject: IMPALA-6505: Min-Max predicate push down in ORC scanner
......................................................................


Patch Set 11:

(5 comments)

A few comments about timestamps and testing.

In case of timestamps I would consider simply skipping them for now, as  both 
testing and implementation are trickier for them compared to other types.

http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@958
PS11, Line 958:     case TYPE_TIMESTAMP
I am unsure whether our logic is correct in case there is timezone conversion 
during reading.


http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@973
PS11, Line 973: nanos
Shouldn't this be nanos % NANOS_PER_SEC ?
I am also unsure about correctness for timestamps before 1970-01-01.

I think that the best solution would be add a new function to TimestampValue 
like UtcToSecondsAndNanos(int64_t* seconds, int32_t nanos) to concentrate logic 
like this in one place.

Another possibility is to simply remove support for Timestamps for now.

The constructor in ORC code for reference: 
https://github.com/apache/orc/blob/main/c++/include/orc/sargs/Literal.hh#L100


http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@991
PS11, Line 991:     case TYPE_CHAR: {
Can you add some comment / possibly DCHECK for CHAR/VARCHAR, as these are 
skipped by the frontend?


http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats.test:

http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test@1
PS11, Line 1: ====
About these tests in general:
I think that we should have both a "negative" and a "positive" test for all 
types, so one that can skip the whole stripe, and one that cannot. Preferably 
the would be to test around both min and max value, e.g. <=min_val, <min_val, 
>=max_val, >max_val to check the edges cases.


http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test@257
PS11, Line 257:  "2009-01-01 00:00:00"
It would be great to test timestamps before 1970-01-01 too, but alltypessmall 
doesn't contain values like that. Maybe create a small test table for that?

I am also missing a test "about not having false negatives", e.g. an = 
predicate with a value that exists in the table.



--
To view, visit http://gerrit.cloudera.org:8080/15403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Gerrit-Change-Number: 15403
Gerrit-PatchSet: 11
Gerrit-Owner: Norbert Luksa <[email protected]>
Gerrit-Reviewer: Anonymous Coward (520)
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Norbert Luksa <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 07 Sep 2021 13:44:41 +0000
Gerrit-HasComments: Yes

Reply via email to