[
https://issues.apache.org/jira/browse/IMPALA-8021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753259#comment-16753259
]
ASF subversion and git services commented on IMPALA-8021:
---------------------------------------------------------
Commit dccb97ba795f58d76d5e0e664685ec52754f059f in impala's branch
refs/heads/master from paul-rogers
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dccb97b ]
IMPALA-8058: Fallback for HBase key scan range estimation
Impala supports "pushing" of HBase key range predicates to HBase so that
Impala reads only rows within the target key range. The planner
estimates the cardinality of such scans by sampling the rows within the
range. However, we have seen cases where sampling returns rows for
unknown reasons. The planner then ends up without a good cardinality
estimate. (Specifically, the code does a division by zero and produces
a huge estimate. See the ticket for details.)
Impala appears to use the sampling strategy to compute cardinality
because HBase uses generally do not gather table stats. The resulting
estimates are often off by 2x or more. This is a problem in tests as it
causes cardinality numbers to vary greatly from the expected values.
Fortunately, tests do gather HMS stats. There may be cases where users
do as well. This fix exploits that fact.
This fix:
* Creates a fall-back strategy that uses table cardinality from HMS and
the selectivity of the key predicates to estimate cardinality when the
sampling approach fails.
* The fall-back strategy requires tracking the predicates used for HBase
keys so that their selectivity can be applied during fall-back
calculations.
* Moved HBase key calculation out of the SingleNodePlanner into the
HBase scan node as suggested by a "TO DO" in the code. Doing so
simplified the new code.
* In the spirit of IMPALA-7919, adds the key predicates to the HBase
scan node in the EXPLAIN output.
Testing:
* Adds a query context option to disable the normal key sampling to
force the use of the fall-back. Used for testing.
* Adds a new set of HBase test cases that use the new feature to check
plans with the fall-back approach.
* Reran all existing tests.
* Compared cardinality numbers for the two modes: sampling and HMS using
the cardinality features of IMPALA-8021. The two approaches provide
different results, but this is mostly due to the missing selectivity
estimates for inequality operators. (That's a fix for another time.)
Change-Id: Ic01147abcb6b184071ba28b55aedc3bc49b322ce
Reviewed-on: http://gerrit.cloudera.org:8080/12192
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Add estimated cardinality to EXPLAIN output
> -------------------------------------------
>
> Key: IMPALA-8021
> URL: https://issues.apache.org/jira/browse/IMPALA-8021
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Minor
>
> The EXPLAIN output provides much useful information in the plan tree. All our
> planning decisions are based on cardinality; but it appears in the EXPLAIN
> output only for the EXTENDED level. The profile only contains the plan from
> the STANDARD level. This change proposes to include row size and cardinality
> even in the STANDARD level.
> The nodes that have the information call it "cardinality", so continue to use
> that term.
> Add cardinality to each node so it appears something like this:
> {noformat}
> HASH JOIN [INNER JOIN, BROADCAST]
> | row-size=89B cardinality=1.23G
> |
> |--SCAN HDFS [db.table]
> | partitions=2/123 files=2 size=4.56MB row-size=89B cardinality=7.89M
> {noformat}
> Cardinality should appear in all levels above MINIMAL. Cardinality is not
> needed for EXCHANGE since it can be inferred from other nodes.
> Also, the existing code prints large cardinalities in detail: 1234567890,
> which is hard to read. Use the abbreviated output, using metric (power of
> 1000) units, so 1.23G instead.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]