[jira] [Comment Edited] (CALCITE-5842) LogicalProject deepHashCode creates same value with different RowType

Yu Tian (Jira) Thu, 13 Jul 2023 14:09:04 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-5842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742954#comment-17742954
 ]


Yu Tian edited comment on CALCITE-5842 at 7/13/23 9:08 PM:
-----------------------------------------------------------

Hi [~Chunwei Lei] 

Thanks for your reply. Yes, what you said is correct, the reason I ask is 
because from our side, we use LogicalProject to perform renaming in each of 
transformation.

Besides, it is more like a split use case, single LogicalTableScan, and we 
split them out to different flows.
{code:java}
LogicalProject(RESULT_2.MOCK_DATA_JSON.name=[$0], 
RESULT_2.MOCK_DATA_JSON.location=[$1], RESULT_2.MOCK_DATA_JSON.satellites=[$2], 
RESULT_2.MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {14.0 
rows, 24.0 cpu, 0.0 io}, id = 1815
      LogicalFilter(condition=[>(LENGTH($1), 0)]): rowcount = 1.0, cumulative 
cost = {13.0 rows, 20.0 cpu, 0.0 io}, id = 1814
        LogicalProject(FILTER_2.MOCK_DATA_JSON.name=[$0], 
FILTER_2.MOCK_DATA_JSON.location=[$1], FILTER_2.MOCK_DATA_JSON.satellites=[$2], 
FILTER_2.MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {12.0 
rows, 19.0 cpu, 0.0 io}, id = 1812
          LogicalProject(MOCK_DATA_JSON.name=[$0], 
MOCK_DATA_JSON.location=[$1], MOCK_DATA_JSON.satellites=[$2], 
MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {11.0 rows, 15.0 
cpu, 0.0 io}, id = 1811
            LogicalTableScan(table=[[test-data, FILE_ENTITY:mock_data.json]]): 
rowcount = 1.0, cumulative cost = {10.0 rows, 11.0 cpu, 0.0 io}, id = 1810

LogicalProject(RESULT_1.MOCK_DATA_JSON.name=[$0], 
RESULT_1.MOCK_DATA_JSON.location=[$1], RESULT_1.MOCK_DATA_JSON.satellites=[$2], 
RESULT_1.MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {14.0 
rows, 24.0 cpu, 0.0 io}, id = 1821
      LogicalFilter(condition=[>(LENGTH($0), 0)]): rowcount = 1.0, cumulative 
cost = {13.0 rows, 20.0 cpu, 0.0 io}, id = 1820
        LogicalProject(FILTER_1.MOCK_DATA_JSON.name=[$0], 
FILTER_1.MOCK_DATA_JSON.location=[$1], FILTER_1.MOCK_DATA_JSON.satellites=[$2], 
FILTER_1.MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {12.0 
rows, 19.0 cpu, 0.0 io}, id = 1818
          LogicalProject(MOCK_DATA_JSON.name=[$0], 
MOCK_DATA_JSON.location=[$1], MOCK_DATA_JSON.satellites=[$2], 
MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {11.0 rows, 15.0 
cpu, 0.0 io}, id = 1811
            LogicalTableScan(table=[[test-data, FILE_ENTITY:mock_data.json]]): 
rowcount = 1.0, cumulative cost = {10.0 rows, 11.0 cpu, 0.0 io}, id = 1810{code}
 

If we pass above 2 RelNodes in the VolcanoPlanner, 
{code:java}
LogicalProject(MOCK_DATA_JSON.name=[$0], MOCK_DATA_JSON.location=[$1], 
MOCK_DATA_JSON.satellites=[$2], MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, 
cumulative cost = {11.0 rows, 15.0 cpu, 0.0 io}, id = 1811
            LogicalTableScan(table=[[test-data, FILE_ENTITY:mock_data.json]]): 
rowcount = 1.0, cumulative cost = {10.0 rows, 11.0 cpu, 0.0 io}, id = 1810 
{code}
This part will be cached in the mapDigestToRel, after the planner phase, 
LogicalProject from FILTER_1 and FILTER_2 will be mixed together, it will 
replace the FILTER_1 with FILTER_2(since FILTER_2 is processed first and 
cached).

Actually, from our side, we are thinking to put some indicator in the Hint 
structure, since hints is already considered as an input in the hashCode 
calculation, use it to different FILTER_1 vs FILTER_2 is an alternative 
solution for us for now.


was (Author: ytian):
Hi [~Chunwei Lei] 

Thanks for your reply. Yes, what you said is correct, the reason I ask is 
because from our side, we use LogicalProject to perform renaming in each of 
transformation.

 
{code:java}
LogicalProject(RESULT_2.MOCK_DATA_JSON.name=[$0], 
RESULT_2.MOCK_DATA_JSON.location=[$1], RESULT_2.MOCK_DATA_JSON.satellites=[$2], 
RESULT_2.MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {14.0 
rows, 24.0 cpu, 0.0 io}, id = 1815
      LogicalFilter(condition=[>(LENGTH($1), 0)]): rowcount = 1.0, cumulative 
cost = {13.0 rows, 20.0 cpu, 0.0 io}, id = 1814
        LogicalProject(FILTER_2.MOCK_DATA_JSON.name=[$0], 
FILTER_2.MOCK_DATA_JSON.location=[$1], FILTER_2.MOCK_DATA_JSON.satellites=[$2], 
FILTER_2.MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {12.0 
rows, 19.0 cpu, 0.0 io}, id = 1812
          LogicalProject(MOCK_DATA_JSON.name=[$0], 
MOCK_DATA_JSON.location=[$1], MOCK_DATA_JSON.satellites=[$2], 
MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {11.0 rows, 15.0 
cpu, 0.0 io}, id = 1811
            LogicalTableScan(table=[[test-data, FILE_ENTITY:mock_data.json]]): 
rowcount = 1.0, cumulative cost = {10.0 rows, 11.0 cpu, 0.0 io}, id = 1810

LogicalProject(RESULT_1.MOCK_DATA_JSON.name=[$0], 
RESULT_1.MOCK_DATA_JSON.location=[$1], RESULT_1.MOCK_DATA_JSON.satellites=[$2], 
RESULT_1.MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {14.0 
rows, 24.0 cpu, 0.0 io}, id = 1821
      LogicalFilter(condition=[>(LENGTH($0), 0)]): rowcount = 1.0, cumulative 
cost = {13.0 rows, 20.0 cpu, 0.0 io}, id = 1820
        LogicalProject(FILTER_1.MOCK_DATA_JSON.name=[$0], 
FILTER_1.MOCK_DATA_JSON.location=[$1], FILTER_1.MOCK_DATA_JSON.satellites=[$2], 
FILTER_1.MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {12.0 
rows, 19.0 cpu, 0.0 io}, id = 1818
          LogicalProject(MOCK_DATA_JSON.name=[$0], 
MOCK_DATA_JSON.location=[$1], MOCK_DATA_JSON.satellites=[$2], 
MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, cumulative cost = {11.0 rows, 15.0 
cpu, 0.0 io}, id = 1811
            LogicalTableScan(table=[[test-data, FILE_ENTITY:mock_data.json]]): 
rowcount = 1.0, cumulative cost = {10.0 rows, 11.0 cpu, 0.0 io}, id = 1810{code}
 

If we pass above 2 RelNodes in the VolcanoPlanner, 
{code:java}
LogicalProject(MOCK_DATA_JSON.name=[$0], MOCK_DATA_JSON.location=[$1], 
MOCK_DATA_JSON.satellites=[$2], MOCK_DATA_JSON.goods=[$3]): rowcount = 1.0, 
cumulative cost = {11.0 rows, 15.0 cpu, 0.0 io}, id = 1811
            LogicalTableScan(table=[[test-data, FILE_ENTITY:mock_data.json]]): 
rowcount = 1.0, cumulative cost = {10.0 rows, 11.0 cpu, 0.0 io}, id = 1810 
{code}
This part will be cached in the mapDigestToRel, after the planner phase, 
LogicalProject from FILTER_1 and FILTER_2 will be mixed together, it will 
replace the FILTER_1 with FILTER_2(since FILTER_2 is processed first and 
cached).

Actually, from our side, we are thinking to put some indicator in the Hint 
structure, since hints is already considered as an input in the hashCode 
calculation, use it to different FILTER_1 vs FILTER_2 is an alternative 
solution for us for now.

> LogicalProject deepHashCode creates same value with different RowType 
> ----------------------------------------------------------------------
>
>                 Key: CALCITE-5842
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5842
>             Project: Calcite
>          Issue Type: Bug
>    Affects Versions: 1.32.0
>            Reporter: Yu Tian
>            Priority: Major
>
> The LogicalProject class has deepEquals0 and deepHashCode0 methods, in the 
> deepEquals0 method, it consider getRowType() as one equal standard, however, 
> in the deepHashCode0, it is missing the getRowType() to generated the hash 
> value. Do we do this on purpose or it is a bug?
> [https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/core/Project.java#L348,L368]
>  
>  
> The reason we ask is that we are trying 2 use cases from our side.
> The first one is two LogicalTableScan with similar configurations, which are 
> connected to 2 separate LogicalFiler, then we LogicalJoin these 2 together. 
> One issue we noticed is that, in HepPlanner, it has logics as below
>  
> package org.apache.calcite.plan.hep.HepPlanner
> {code:java}
> // try to find equivalent rel only if DAG is allowed
> if (!noDag) {
>   // Now, check if an equivalent vertex already exists in graph.
>   HepRelVertex equivVertex = mapDigestToVertex.get(rel.getRelDigest());
>   if (equivVertex != null) {
>     // Use existing vertex.
>     return equivVertex;
>   }
> } {code}
>  
> The 2 logicalProjects from the 2 LogicalTableScans have same hashCode value 
> based on the deepHashCode method in LogicalProject, because it didn’t 
> consider the getRowType() value, the planner is replacing LogicalTableScan2 
> with LogicalTableScan1, in fact, we should treat them as separate items to 
> process. 
>  
> Another use case we have, we have 2 diagrams, each diagram with 
> LogicalTableScan, LogicalFiler, LogicalTableModify, LogicalTableScan have 
> similar setup with different rowType information. This time, HepPlanner is 
> passing, since it has separate HepPlanner stage, so above issue is not 
> happening. However, when it reach the VolcanoPlanner, the logics
>  
> package org.apache.calcite.plan.volcano.VolcanoPlanner
> {code:java}
> // If it is equivalent to an existing expression, return the set that
> // the equivalent expression belongs to.
> RelDigest digest = rel.getRelDigest();
> RelNode equivExp = mapDigestToRel.get(digest); {code}
>  
> The map replace the LogicalTableScan1 with LogicalTableScan2 in the 
> LogicalProject stage since they have same hashCode, and the map is reusing 
> earlier processed RelNode, which caused the issues.
>  
> Here are the proposals we have,
>  
>  * Narrow Scope change: LogicalProject is the most frequently used project 
> type, we only change it.
>  ** Modify the LogicalProject method deepHashCode method to use 
> {code:java}
> @Override public int deepHashCode() {
>   return Objects.hash(traitSet, input.deepHashCode(), exps, hints, 
> getRowType());
> }{code}
> Consider the getRowType() value in the hash generation will resolve the 
> issue, since the rowType contains the field names and data types information. 
>  
>  * Whole Scope change: Change the deepHashCode method in Project class.
>  ** Similar change as above, however, the scope of this change is wide 
> compared to the first one.
>  
> Is it something we can consider to improve in the following release of Apache 
> Calcite?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CALCITE-5842) LogicalProject deepHashCode creates same value with different RowType

Reply via email to