[I] Optimize projection naming of GenerateRel to Velox Unnest conversion [incubator-gluten]

via GitHub Tue, 21 May 2024 20:55:27 -0700


xumingming opened a new issue, #5837:
URL: https://github.com/apache/incubator-gluten/issues/5837


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   During developing of PR: 
https://github.com/apache/incubator-gluten/pull/5782 , it reveals a 
bug/weakness of converting substrait GenerateRel to Velox plan.
   
   The test case:
   ```
     test("test inline function1") {
       // CreateArray: func(array(col))
       withTempView("script_trans") {
         sql("""SELECT * FROM VALUES
               |(1, 2, 3),
               |(4, 5, 6),
               |(7, 8, 9)
               |AS script_trans(a, b, c)
            """.stripMargin).createOrReplaceTempView("script_trans")
         runQueryAndCompare(s"""SELECT TRANSFORM(b, MAX(a), CAST(SUM(c) AS 
STRING), myCol, myCol2)
                               |  USING 'cat' AS (a STRING, b STRING, c STRING, 
d ARRAY<INT>, e STRING)
                               |FROM script_trans
                               |LATERAL VIEW explode(array(array(1,2,3))) 
myTable AS myCol
                               |LATERAL VIEW explode(myTable.myCol) myTable2 AS 
myCol2
                               |WHERE a <= 4
                               |GROUP BY b, myCol, myCol2
                               |HAVING max(a) > 1""".stripMargin) {
           checkGlutenOperatorMatch[GenerateExecTransformer]
         }
       }
     }
   ```
   
   The key here is there are multiple consecutive generator functions. The 
converted Velox plan is:
   
   
   ```
   -- Unnest[6][n5_8] -> n5_4:INTEGER, n5_5:INTEGER, n5_6:INTEGER, 
n5_7:ARRAY<INTEGER>, C0:INTEGER
     -- Project[5][expressions: (n5_4:INTEGER, "n2_3"), (n5_5:INTEGER, "n2_4"), 
(n5_6:INTEGER, "n2_5"), (n5_7:ARRAY<INTEGER>, "C0"), (n5_8:ARRAY<INTEGER>, 
"C0")] -> n5_4:INTEGER, n5_5:INTEGER, n5_6:INTEGER, n5_7:ARRAY<INTEGER>, 
n5_8:ARRAY<INTEGER>
       -- Filter[4][expression: greaterthan(size("C0"),0)] -> n2_3:INTEGER, 
n2_4:INTEGER, n2_5:INTEGER, C0:ARRAY<INTEGER>
         -- Unnest[3][n2_6] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, 
C0:ARRAY<INTEGER>
           -- Project[2][expressions: (n2_3:INTEGER, "n0_0"), (n2_4:INTEGER, 
"n0_1"), (n2_5:INTEGER, "n0_2"), (n2_6:ARRAY<ARRAY<INTEGER>>, 1 elements 
starting at 0 {3 elements starting at 0 {1, 2, 3}})] -> n2_3:INTEGER, 
n2_4:INTEGER, n2_5:INTEGER, n2_6:ARRAY<ARRAY<INTEGER>>
             -- Filter[1][expression: lessthanorequal("n0_0",4)] -> 
n0_0:INTEGER, n0_1:INTEGER, n0_2:INTEGER
               -- ValueStream[0][] -> n0_0:INTEGER, n0_1:INTEGER, n0_2:INTEGER
   ```
   
   Note the first Project(line 2) does two things:
   
   1. Rename columns, e.g. n2_3 -> n5_4, but does not change the value of 
column.
   2. Projected an unnecessary column: n5_8, which is the same as n5_7
   
   So it effective does nothing, we should be able to remove it safely. But if 
we actually remove it, the resulting plan will have duplicated projection 
column name with different types:
   
   ```
   -- Unnest[5][C0] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, 
C0:ARRAY<INTEGER>, C0:INTEGER
     -- Filter[4][expression: greaterthan(size("C0"),0)] -> n2_3:INTEGER, 
n2_4:INTEGER, n2_5:INTEGER, C0:ARRAY<INTEGER>
       -- Unnest[3][n2_6] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, 
C0:ARRAY<INTEGER>
         -- Project[2][expressions: (n2_3:INTEGER, "n0_0"), (n2_4:INTEGER, 
"n0_1"), (n2_5:INTEGER, "n0_2"), (n2_6:ARRAY<ARRAY<INTEGER>>, 1 elements 
starting at 0 {3 elements starting at 0 {1, 2, 3}})] -> n2_3:INTEGER, 
n2_4:INTEGER, n2_5:INTEGER, n2_6:ARRAY<ARRAY<INTEGER>>
           -- Filter[1][expression: lessthanorequal("n0_0",4)] -> n0_0:INTEGER, 
n0_1:INTEGER, n0_2:INTEGER
             -- ValueStream[0][] -> n0_0:INTEGER, n0_1:INTEGER, n0_2:INTEGER
   ```
   
   Note the two `C0` in the first line, I'd like to fix this projection column 
naming issue first in a dedicated PR, how do you think @marin-ma @zhouyuan ?
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Optimize projection naming of GenerateRel to Velox Unnest conversion [incubator-gluten]

Reply via email to