[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40362:
------------------------------------

    Assignee: Apache Spark

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-40362
>                 URL: https://issues.apache.org/jira/browse/SPARK-40362
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Asif
>            Assignee: Apache Spark
>            Priority: Major
>              Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to