[
https://issues.apache.org/jira/browse/CALCITE-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096985#comment-17096985
]
Haisheng Yuan commented on CALCITE-3963:
----------------------------------------
As long as all the alternatives in a RelSet share the same logical properties,
we don't care where the logical properties are stored.
I am afraid the 'fold' operator will make things complicated. What about
cardinality and selectivity? We may just end up with choosing one blindly. It
doesn't seem right that we use alternative 1's cardinality info, use
alternative 2's selectivity info, and use all the alternatives' unique keys ...
Admittedly, each alternative's stats may vary a lot, one of the reason is that
Calcite believes all the simplification should be done in VolcanoPlanner and
selected based on cost, while other systems like Sql Server and Greenplum do
all the simplification like constant folding, join simplification, predicate
push-down before the logical plan goes into the MEMO.
One of the reason to share logical properties between alternatives in a group
is that it becomes possible (in the future) to do early decision to stop
exploring this group. If we use the 'fold' operator to decide the group's
logical properties, when is it good time to decide?
Option 1: whenever there is a new alternative, recomputing the logical
properties. That may be not better than just storing logical properties for
each relnode.
Option 2: roll it up after all the logical alternatives are generated. But
there is no logical / physical difference, we don't know it is logical operator
or not. Judging by convention is not perfect, because systems like Flink,
Drill, Ignite define their own logical convention. There is no logical rule and
physical rule difference either, they are matched and applied at the same
stage. Physical rules can even generate logical operators, like
ProjectMergeRule, will these generated logical operators be counted?
Another reason to share logical properties is to avoid redundant computation.
For example,
{code:java}
SELECT a,b,c,max(d) FROM foo GROUP BY a,b,c;
HashAggregate
+-- TableScan
{code}
In distributed system, suppose we generate HashAgg with distribution
alternatives of all the 8 key combinations. In SQL Server, there is only 1
physical operator HashAgg, but in Calcite, there are 8 HashAgg operators, the
same HashAgg with different traitset. We will get another 8 exchange operators
(in Calcite 1.22 and before, there were more than 50 exchange operators), we
need to compute the logical properties for all the HashAgg and Exchange
operators, even the result is cached in metadata system, but these operators
are just throwing money that are left on the table by LogicalAggregate operator.
> Maintains logical properties at RelSet (equivalent group) instead of RelNode
> ----------------------------------------------------------------------------
>
> Key: CALCITE-3963
> URL: https://issues.apache.org/jira/browse/CALCITE-3963
> Project: Calcite
> Issue Type: Bug
> Reporter: Xiening Dai
> Assignee: Xiening Dai
> Priority: Major
>
> Currently the logical properties (such as row count, distinct row count, etc)
> are maintained at RelNode level. This creates a number of meta data
> consistency problems, e.g. CALCITE-1048, CALCITE-2166.
> In theory, all RelNodes in a RelSet should share the same logical properties
> per definition of relational equivalence. So it makes more sense to keep
> logical properties at RelSet level, rather than the RelNode. And such
> properties shouldn't change when new sub set is created or subset's best is
> changed.
> Specifically I think below build in metadata should fall into the logical
> properties category -
> Selectivity
> UniqueKeys
> ColumnUniqueness
> RowCount
> MaxRowCount
> MinRowCount
> DistinctRowCount
> Size (averageRowSize, averageColumnSize)
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)