Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15403 )
Change subject: IMPALA-6505: Min-Max predicate push down in ORC scanner ...................................................................... Patch Set 11: (5 comments) A few comments about timestamps and testing. In case of timestamps I would consider simply skipping them for now, as both testing and implementation are trickier for them compared to other types. http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@958 PS11, Line 958: case TYPE_TIMESTAMP I am unsure whether our logic is correct in case there is timezone conversion during reading. http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@973 PS11, Line 973: nanos Shouldn't this be nanos % NANOS_PER_SEC ? I am also unsure about correctness for timestamps before 1970-01-01. I think that the best solution would be add a new function to TimestampValue like UtcToSecondsAndNanos(int64_t* seconds, int32_t nanos) to concentrate logic like this in one place. Another possibility is to simply remove support for Timestamps for now. The constructor in ORC code for reference: https://github.com/apache/orc/blob/main/c++/include/orc/sargs/Literal.hh#L100 http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@991 PS11, Line 991: case TYPE_CHAR: { Can you add some comment / possibly DCHECK for CHAR/VARCHAR, as these are skipped by the frontend? http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test File testdata/workloads/functional-query/queries/QueryTest/orc-stats.test: http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test@1 PS11, Line 1: ==== About these tests in general: I think that we should have both a "negative" and a "positive" test for all types, so one that can skip the whole stripe, and one that cannot. Preferably the would be to test around both min and max value, e.g. <=min_val, <min_val, >=max_val, >max_val to check the edges cases. http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test@257 PS11, Line 257: "2009-01-01 00:00:00" It would be great to test timestamps before 1970-01-01 too, but alltypessmall doesn't contain values like that. Maybe create a small test table for that? I am also missing a test "about not having false negatives", e.g. an = predicate with a value that exists in the table. -- To view, visit http://gerrit.cloudera.org:8080/15403 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I136622413db21e0941d238ab6aeea901a6464845 Gerrit-Change-Number: 15403 Gerrit-PatchSet: 11 Gerrit-Owner: Norbert Luksa <[email protected]> Gerrit-Reviewer: Anonymous Coward (520) Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Norbert Luksa <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 07 Sep 2021 13:44:41 +0000 Gerrit-HasComments: Yes
