[
https://issues.apache.org/jira/browse/IMPALA-8034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers resolved IMPALA-8034.
---------------------------------
Resolution: Fixed
> PlannerTest cardinality tests are not realistic
> -----------------------------------------------
>
> Key: IMPALA-8034
> URL: https://issues.apache.org/jira/browse/IMPALA-8034
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Minor
>
> Impala generally assumes that queries are M:1, joined on the FK/PK. A PK
> uniquely identifies a row, so {{|pl1| = |Table|}}. This assumption is build
> into join estimation: that columns are independent, so if we have multiple
> keys, {{|pk1| * |pk2| * … * |pkn| = |Table|}}.
> But, PlannerTest frequently uses non-independent, non unique columns. For
> example, it might join on both the (unique) {{id}} column and the non-unique
> {{int_col}} column, which throws off calculations. For example:
> {noformat}
> select *
> from functional.alltypesagg a
> full outer join functional.alltypessmall b using (id, int_col)
> right join functional.alltypesaggnonulls c on (a.id = c.id and b.string_col =
> c.string_col)
> {noformat}
> If we then try to get the estimated cardinalities to match the actual
> cardinalities obtained from running the query, we end up fighting our
> assumptions. This shows up in the code: rather than use the classic
> assumption that the key columns are independent, the code uses special
> adjustments for redundant columns, perhaps so that tests such as the above
> produce good estimates.
> Better to modify (or add) tests that are based on our assumptions so we can
> verify that the intended logic works. It is fine to then add a few “oddball”
> queries to see how well the estimates hold up when the data (or user) does
> not follow the independence assumption.
> Alternatively, add new tests that use realistic joins, and retain the
> existing tests, adding a note of explanation why the resulting cardinality
> estimates appear wrong (because we are using unrealistic, redundant columns
> in joins, which real users seldom do.)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)