[Impala-ASF-CR] IMPALA-6505: Min-Max predicate push down in ORC scanner

Quanlong Huang (Code Review) Wed, 15 Sep 2021 20:03:52 -0700

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15403 )


Change subject: IMPALA-6505: Min-Max predicate push down in ORC scanner
......................................................................


Patch Set 12:

(8 comments)

> Patch Set 11:
>
> (5 comments)
>
> A few comments about timestamps and testing.
>
> In case of timestamps I would consider simply skipping them for now, as  both 
> testing and implementation are trickier for them compared to other types.

Thank Csaba and Qifan for taking another look! I remove the logics of timestamp 
predicates and will deal with them in IMPALA-10915. Since this is a revived 
work of Norbert and he recently doesn't have time on this, I suggest defering 
optimizations/refactoring in follow-up patches and let's focus on bug-fixing in 
this patch. Thanks!

http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@958
PS11, Line 958:   const ColumnType& typ
> I am unsure whether our logic is correct in case there is timezone conversi
Let's follow up this in IMPALA-10915.


http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@973
PS11, Line 973:
> Shouldn't this be nanos % NANOS_PER_SEC ?
Oops! yeah, it should be nanos % NANOS_PER_SEC. Let me remove Timestamp in this 
patch and fully test it in a follow-up JIRA: IMPALA-10915.


http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@991
PS11, Line 991:       return orc::Lit
> Can you add some comment / possibly DCHECK for CHAR/VARCHAR, as these are s
Done


http://gerrit.cloudera.org:8080/#/c/15403/11/be/src/exec/hdfs-orc-scanner.cc@1053
PS11, Line 1053:   const TupleDescriptor* min_max_tuple_desc = 
scan_node_->min_max_tuple_desc();
               :   if (!min_max_tuple_desc) return Status::OK();
               :   // TODO(IMPALA-10894): pushing down predicates into the ORC 
reader wi
> Since HdfsOrcScanner operates at the thread level, the work here (all the way 
> to populate sarg) can be moved to the scan node level. That is, we populate 
> sarg once and allow all scanners to read it and create the push-down 
> predicate within each.

The problem is that building 'sarg' uses the column ids in ORC files which can 
differ from file to file (due to table schema evolution). Different scanner may 
read different ORC files so requires a different sarg. We need to check whether 
all ORC files read by a scan node have the same file-level schema (column ids), 
which needs some refactoring of the current framework. We might not benifit a 
lot for this optimization since this is not in the hot path.

> To check literals, can we use min_max_conjuncts_ (std::vector<ScalarExpr*>) 
> directly, without the need to populate min_max_conjunct_evals_?

I think yes. Since the parquet scanner also uses this pattern, can we leave 
this optimization in later patches? So we can refactor them together and keep 
this patch simple.


http://gerrit.cloudera.org:8080/#/c/15403/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/15403/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@575
PS11, Line 575:   addMinMaxOriginalConjunct(slotDesc.getParent(), binaryPred);
              :       buildStatsPredicate(analyzer, slotRef, binaryPred, 
binaryPred.getOp());
              :     } else if 
(BinaryPredicate.IS_EQ_PREDICATE.apply(binaryPred)) {
              :       addMinMaxOriginalConjunct(slotDesc.getParent(), 
binaryPred);
              :       // TODO: this could be optimized for boolean columns.
              :
> Optimization: we may keep EQ operator as is for ORC, since ORC supports EQU
Yeah, this will be done in the next patch: https://gerrit.cloudera.org/c/17815
The current patch is for reviving Norbert's work. So I hope we can add new 
items/improvements in follow-up patches.


http://gerrit.cloudera.org:8080/#/c/15403/11/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@581
PS11, Line 581:       buildStatsPredicate(analyzer, slotRef, binaryPred, 
BinaryPredicate.Operator.GE);
> Future optimization: we may also push-down IsNullPredicate.
Yeah, same as the above reply.


http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test
File testdata/workloads/functional-query/queries/QueryTest/orc-stats.test:

http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test@1
PS11, Line 1: ====
> About these tests in general:
Thanks for the suggestion! I do find an issue (IMPALA-10916) after adding more 
tests.


http://gerrit.cloudera.org:8080/#/c/15403/11/testdata/workloads/functional-query/queries/QueryTest/orc-stats.test@257
PS11, Line 257:
> It would be great to test timestamps before 1970-01-01 too, but alltypessma
Let's follow up this in IMPALA-10915.



--
To view, visit http://gerrit.cloudera.org:8080/15403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Gerrit-Change-Number: 15403
Gerrit-PatchSet: 12
Gerrit-Owner: Norbert Luksa <[email protected]>
Gerrit-Reviewer: Anonymous Coward (520)
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Norbert Luksa <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 16 Sep 2021 03:03:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6505: Min-Max predicate push down in ORC scanner

Reply via email to