[ 
https://issues.apache.org/jira/browse/IMPALA-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-8045:
--------------------------------
    Description: 
The work to review join cardinality has found some major issues recorded as 
JIRA tickets. This ticket records a number of smaller issues. Some of these 
issues are a bit tricky because they appear only when some of the other issues 
are resolved. Reporting them directly could be misleading.

h4. ScanNode confusion between table and scan input cardinality

The {{ScanNode}} class in the scanner contains an {{inputCardinality_}} field 
used by join calculations as a proxy for the table size. However, the actual 
scan node implementations set the {{inputCardinality_}} to the estimated number 
of rows *read* by the scan, which is useful when understanding the physical 
scan structure. But, for joins, we need the base table cardinality.

For example, the join may use the input cardinality to understand the reduction 
in rows due to filters in order to adjust the NDV of key columns. But, since 
the input cardinality is the scan count, not the table row count, the math does 
not work out.

The solution is to clarify the code to separate the idea of scan count vs. base 
table row count.

h4. Selectivity Confusion

Similarly, each node computes its selectivity. However, the selectivity is only 
for those predicates that will be applied via a projection. Predicates that can 
be applied because of partition pruning (HDFS), key range pruning (HBase) and 
so on do not "count". While this produces accurate execution estimates, it is 
not helpful for join planning.

In join planning, we need to know the number of filtered rows relative to the 
total table cardinality. This allows us to adjust HDV key cardinality in order 
to estimate the number of rows produced by the join.

Using the partial selectivity, or partial input cardinality (above issue) 
causes inaccurate key cardinality adjustments and incorrect join cardinality 
estimates.

h4. Join Node Does not Apply Selectivity from Its Predicates

A join node can have "additional predicates" applied after creating a join row. 
Accurate estimation of join cardinality must include the selectivity from those 
predicates, but is not currently done. Perhaps because such predicates, in the 
current estimation scheme, always produce an estimated selectivity of .1. This 
will be more important as we add more realistic estimates.

h4. Use Double, not Long for Cardinality Values

In scan nodes, row counts can be reasonable numbers and a Java {{long}} is 
fine. But, once one starts computing join cardinalities, values can grow fast, 
especially for cross joins. The code currently has special checks to limit 
products to {{Long.MAX_VALUE}}. While this solves the overflow issue, it has 
undesirable downstream affects. First, it throws of selectivity calculations 
since the reported cardinality is not the real cardinality. Second, it requires 
special math calls whenever we multiply cardinalities.

Much simper to work with a {{double}}. When values get large, the extra 
precision from a integer value is completely lost in the noise of assumptions 
and estimations.

  was:
The work to review join cardinality has found some major issues recorded as 
JIRA tickets. This ticket records a number of smaller issues. Some of these 
issues are a bit tricky because they appear only when some of the other issues 
are resolved. Reporting them directly could be misleading.

h4. ScanNode confusion between table and scan input cardinality

The {{ScanNode}} class in the scanner contains an {{inputCardinality_}} field 
used by join calculations as a proxy for the table size. However, the actual 
scan node implementations set the {{inputCardinality_}} to the estimated number 
of rows *read* by the scan, which is useful when understanding the physical 
scan structure. But, for joins, we need the base table cardinality.

For example, the join may use the input cardinality to understand the reduction 
in rows due to filters in order to adjust the NDV of key columns. But, since 
the input cardinality is the scan count, not the table row count, the math does 
not work out.

The solution is to clarify the code to separate the idea of scan count vs. base 
table row count.

h4. Selectivity Confusion

Similarly, each node computes its selectivity. However, the selectivity is only 
for those predicates that will be applied via a projection. Predicates that can 
be applied because of partition pruning (HDFS), key range pruning (HBase) and 
so on do not "count". While this produces accurate execution estimates, it is 
not helpful for join planning.

In join planning, we need to know the number of filtered rows relative to the 
total table cardinality. This allows us to adjust HDV key cardinality in order 
to estimate the number of rows produced by the join.

Using the partial selectivity, or partial input cardinality (above issue) 
causes inaccurate key cardinality adjustments and incorrect join cardinality 
estimates.

h4. Join Node Does not Apply Selectivity from Its Predicates

A join node can have "additional predicates" applied after creating a join row. 
Accurate estimation of join cardinality must include the selectivity from those 
predicates, but is not currently done. Perhaps because such predicates, in the 
current estimation scheme, always produce an estimated selectivity of .1. This 
will be more important as we add more realistic estimates.


> Rollup of Smaller Join Cardinality Issues
> -----------------------------------------
>
>                 Key: IMPALA-8045
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8045
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> The work to review join cardinality has found some major issues recorded as 
> JIRA tickets. This ticket records a number of smaller issues. Some of these 
> issues are a bit tricky because they appear only when some of the other 
> issues are resolved. Reporting them directly could be misleading.
> h4. ScanNode confusion between table and scan input cardinality
> The {{ScanNode}} class in the scanner contains an {{inputCardinality_}} field 
> used by join calculations as a proxy for the table size. However, the actual 
> scan node implementations set the {{inputCardinality_}} to the estimated 
> number of rows *read* by the scan, which is useful when understanding the 
> physical scan structure. But, for joins, we need the base table cardinality.
> For example, the join may use the input cardinality to understand the 
> reduction in rows due to filters in order to adjust the NDV of key columns. 
> But, since the input cardinality is the scan count, not the table row count, 
> the math does not work out.
> The solution is to clarify the code to separate the idea of scan count vs. 
> base table row count.
> h4. Selectivity Confusion
> Similarly, each node computes its selectivity. However, the selectivity is 
> only for those predicates that will be applied via a projection. Predicates 
> that can be applied because of partition pruning (HDFS), key range pruning 
> (HBase) and so on do not "count". While this produces accurate execution 
> estimates, it is not helpful for join planning.
> In join planning, we need to know the number of filtered rows relative to the 
> total table cardinality. This allows us to adjust HDV key cardinality in 
> order to estimate the number of rows produced by the join.
> Using the partial selectivity, or partial input cardinality (above issue) 
> causes inaccurate key cardinality adjustments and incorrect join cardinality 
> estimates.
> h4. Join Node Does not Apply Selectivity from Its Predicates
> A join node can have "additional predicates" applied after creating a join 
> row. Accurate estimation of join cardinality must include the selectivity 
> from those predicates, but is not currently done. Perhaps because such 
> predicates, in the current estimation scheme, always produce an estimated 
> selectivity of .1. This will be more important as we add more realistic 
> estimates.
> h4. Use Double, not Long for Cardinality Values
> In scan nodes, row counts can be reasonable numbers and a Java {{long}} is 
> fine. But, once one starts computing join cardinalities, values can grow 
> fast, especially for cross joins. The code currently has special checks to 
> limit products to {{Long.MAX_VALUE}}. While this solves the overflow issue, 
> it has undesirable downstream affects. First, it throws of selectivity 
> calculations since the reported cardinality is not the real cardinality. 
> Second, it requires special math calls whenever we multiply cardinalities.
> Much simper to work with a {{double}}. When values get large, the extra 
> precision from a integer value is completely lost in the noise of assumptions 
> and estimations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to