jackwener commented on issue #6790:
URL: 
https://github.com/apache/arrow-datafusion/issues/6790#issuecomment-1623576542

   I have been quite busy lately, and today a newcomer @YjyJeff in the 
community encountered the same issue and submitted a PR #6862 related to it. So 
I took some time to systematically analyze this problem and help others in the 
community better understand it.
   
   This problem is caused by #6595 directly. Although the idea of #6595 is 
right, but we cannot change so for the time being. Because it will expose the 
current design problem ( schema will be change in optimization).
   
   Let me have explain this problem:
   
   ```
   There is a optimizated rule: simplify expr:
   project: t1.a + 1 + 2
   its schema is Schema: Field [qualifier: None, name:`t1.a + 1 + 2`]
   
   after simplify expr:
   project: t1.a + 3
   
   if we use original schema: schema still is 
   Schema: Field [qualifier: None, name:`t1.a + 1 + 2`]
   if we recompute schema, its schema is Schema: Field [qualifier: None, 
name:`t1.a + 3`]
   
   so, if we recompute new schema, it will cause `mismatch schema error`
   of occurse, we can add alias, it will make field keep same after 
recomputation.
   
   BUT, alias can't avoid this corner case 
https://github.com/apache/arrow-datafusion/pull/6595#discussion_r1225815262:
   alias('t1.a'), field is [qualifier: none, name: t1.a], we hope field is 
[qualifier: t1, name: a].
   I try to use #6681 #6826 to resolve it, but it will cause more complexity, I 
think it's a good idea.
   
   so I think a better temporary method is to revert #6595
   ```
   
   ---
   
   Let me talk about more.
   
   In my opinion, it is unnecessary to keep `Schema` unchanged at the 
Optimization stage, because in the process of optimization, it indeed may 
change.
   
   - qualifier/name change
   - nullable change
   - type change #6862
   
   So, keep `Schema` isn't a good design, it will cause some trouble.
   
   ---
   
   Why we need to keep `Schema` unchanged?
   
   Based on my guess, the reason may be that we need to ensure the `references` 
are correct.
   Plan is based on `Schema`. if we change schema of child plan, plan may can't 
get right column from child plan.
   
   This involves a design problem. we shouldn't use `name` as a `references` to 
get a column from child plan, we should use `id` to get a column from child 
plan.
   Because we should use a immutable/unquie item as a `references`.
   If we use this design, we also can avoid problem like #6543.
   
   Many systems in the industry have adopted this design, like spark, impala, 
doris...
   
   such as spark
   ```scala
   case class AttributeReference(
       name: String,
       datatype: DataType,
       nullable: Boolean = true,
       override val metadata: Metadata = Metadata.empty)(
       val exprId: ExprId = NamedExpression.newExprId,
       val qualifier: Seq[String] = Seq.empty[String])
     extends Attribute with Unevaluable
   
     /**
      * Returns true iff the expression id is the same for both attributes.
      */
     def sameRef(other: AttributeReference): Boolean = this.exprId == 
other.exprId
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to