[
https://issues.apache.org/jira/browse/IMPALA-8058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753257#comment-16753257
]
ASF subversion and git services commented on IMPALA-8058:
---------------------------------------------------------
Commit dccb97ba795f58d76d5e0e664685ec52754f059f in impala's branch
refs/heads/master from paul-rogers
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dccb97b ]
IMPALA-8058: Fallback for HBase key scan range estimation
Impala supports "pushing" of HBase key range predicates to HBase so that
Impala reads only rows within the target key range. The planner
estimates the cardinality of such scans by sampling the rows within the
range. However, we have seen cases where sampling returns rows for
unknown reasons. The planner then ends up without a good cardinality
estimate. (Specifically, the code does a division by zero and produces
a huge estimate. See the ticket for details.)
Impala appears to use the sampling strategy to compute cardinality
because HBase uses generally do not gather table stats. The resulting
estimates are often off by 2x or more. This is a problem in tests as it
causes cardinality numbers to vary greatly from the expected values.
Fortunately, tests do gather HMS stats. There may be cases where users
do as well. This fix exploits that fact.
This fix:
* Creates a fall-back strategy that uses table cardinality from HMS and
the selectivity of the key predicates to estimate cardinality when the
sampling approach fails.
* The fall-back strategy requires tracking the predicates used for HBase
keys so that their selectivity can be applied during fall-back
calculations.
* Moved HBase key calculation out of the SingleNodePlanner into the
HBase scan node as suggested by a "TO DO" in the code. Doing so
simplified the new code.
* In the spirit of IMPALA-7919, adds the key predicates to the HBase
scan node in the EXPLAIN output.
Testing:
* Adds a query context option to disable the normal key sampling to
force the use of the fall-back. Used for testing.
* Adds a new set of HBase test cases that use the new feature to check
plans with the fall-back approach.
* Reran all existing tests.
* Compared cardinality numbers for the two modes: sampling and HMS using
the cardinality features of IMPALA-8021. The two approaches provide
different results, but this is mostly due to the missing selectivity
estimates for inequality operators. (That's a fix for another time.)
Change-Id: Ic01147abcb6b184071ba28b55aedc3bc49b322ce
Reviewed-on: http://gerrit.cloudera.org:8080/12192
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> HBase scan cardinality division-by-zero leads to bogus cardinality
> ------------------------------------------------------------------
>
> Key: IMPALA-8058
> URL: https://issues.apache.org/jira/browse/IMPALA-8058
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Major
>
> A particular HBase query has highly selective key filters and runs into code
> bugs that produce a bogus, huge cardinality value.
> {{HbaseScanNode.computeStats()}} attempts to compute table cardinality by
> calling {{HBaseTable.getEstimatedRowStats()}}. This then calls into (in the
> latest versions) {{FeHBaseTable.getEstimatedRowStats()}}.
> This code tries to estimate cardinality by:
> * Scanning a set of regions.
> * For each getting the size.
> * Averaging a bunch of rows to estimate row width.
> Once we know the size of the regions we need to scan, and the average row
> width, we can compute the scan cardinality.
> The problem in this particular query is that the predicates are so selective
> that no regions match. As a result, the average row width is zero. We divide
> (as a double) the region size by 0 and get INF. We cast that to a long and
> get Long.MAX_VALUE. We then use that as our (highly bogus) cardinality
> estimate.
> The code must:
> * Detect the division-by-zero (now sample rows) case.
> * Use an alternative estimate (such as multiplying total table row count from
> HMS by the filter selectivity.)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]