[ 
https://issues.apache.org/jira/browse/IMPALA-8034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760889#comment-16760889
 ] 

ASF subversion and git services commented on IMPALA-8034:
---------------------------------------------------------

Commit b08c8e3db2e769609f47d5c0ed87c547d41d1c8b in impala's branch 
refs/heads/master from paul-rogers
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b08c8e3 ]

IMPALA-8034: Improve planner tests

The FE PlannerTest cases are good, but often unrealistic and overly
complicated, especially when trying to verify selectivity and
cardinality. This commit adds new tests that isolate each bit of the
work for detailed inspection.

The current version of the tests highlighlights a number of bugs to be
fixed by ongoing work. This commit establishes a clear baseline of
current behavior, even if that behavior is not quite right. A "Bug:"
comment explains the expected result.

Tests: These are tests, no production code was changed.

Change-Id: I40e59e08d7ddf2b0391d42e50511aaf95d7275f4
Reviewed-on: http://gerrit.cloudera.org:8080/12145
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> PlannerTest cardinality tests are not realistic
> -----------------------------------------------
>
>                 Key: IMPALA-8034
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8034
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> Impala generally assumes that queries are M:1, joined on the FK/PK. A PK 
> uniquely identifies a row, so {{|pl1| = |Table|}}. This assumption is build 
> into join estimation: that columns are independent, so if we have multiple 
> keys, {{|pk1| * |pk2| * … * |pkn| = |Table|}}.
> But, PlannerTest frequently uses non-independent, non unique columns. For 
> example, it might join on both the (unique) {{id}} column and the non-unique 
> {{int_col}} column, which throws off calculations. For example:
> {noformat}
> select *
> from functional.alltypesagg a
> full outer join functional.alltypessmall b using (id, int_col)
> right join functional.alltypesaggnonulls c on (a.id = c.id and b.string_col = 
> c.string_col)
> {noformat}
> If we then try to get the estimated cardinalities to match the actual 
> cardinalities obtained from running the query, we end up fighting our 
> assumptions. This shows up in the code: rather than use the classic 
> assumption that the key columns are independent, the code uses special 
> adjustments for redundant columns, perhaps so that tests such as the above 
> produce good estimates.
> Better to modify (or add) tests that are based on our assumptions so we can 
> verify that the intended logic works. It is fine to then add a few “oddball” 
> queries to see how well the estimates hold up when the data (or user) does 
> not follow the independence assumption.
> Alternatively, add new tests that use realistic joins, and retain the 
> existing tests, adding a note of explanation why the resulting cardinality 
> estimates appear wrong (because we are using unrealistic, redundant columns 
> in joins, which real users seldom do.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to