xumingming opened a new issue, #5837: URL: https://github.com/apache/incubator-gluten/issues/5837
### Backend VL (Velox) ### Bug description During developing of PR: https://github.com/apache/incubator-gluten/pull/5782 , it reveals a bug/weakness of converting substrait GenerateRel to Velox plan. The test case: ``` test("test inline function1") { // CreateArray: func(array(col)) withTempView("script_trans") { sql("""SELECT * FROM VALUES |(1, 2, 3), |(4, 5, 6), |(7, 8, 9) |AS script_trans(a, b, c) """.stripMargin).createOrReplaceTempView("script_trans") runQueryAndCompare(s"""SELECT TRANSFORM(b, MAX(a), CAST(SUM(c) AS STRING), myCol, myCol2) | USING 'cat' AS (a STRING, b STRING, c STRING, d ARRAY<INT>, e STRING) |FROM script_trans |LATERAL VIEW explode(array(array(1,2,3))) myTable AS myCol |LATERAL VIEW explode(myTable.myCol) myTable2 AS myCol2 |WHERE a <= 4 |GROUP BY b, myCol, myCol2 |HAVING max(a) > 1""".stripMargin) { checkGlutenOperatorMatch[GenerateExecTransformer] } } } ``` The key here is there are multiple consecutive generator functions. The converted Velox plan is: ``` -- Unnest[6][n5_8] -> n5_4:INTEGER, n5_5:INTEGER, n5_6:INTEGER, n5_7:ARRAY<INTEGER>, C0:INTEGER -- Project[5][expressions: (n5_4:INTEGER, "n2_3"), (n5_5:INTEGER, "n2_4"), (n5_6:INTEGER, "n2_5"), (n5_7:ARRAY<INTEGER>, "C0"), (n5_8:ARRAY<INTEGER>, "C0")] -> n5_4:INTEGER, n5_5:INTEGER, n5_6:INTEGER, n5_7:ARRAY<INTEGER>, n5_8:ARRAY<INTEGER> -- Filter[4][expression: greaterthan(size("C0"),0)] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, C0:ARRAY<INTEGER> -- Unnest[3][n2_6] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, C0:ARRAY<INTEGER> -- Project[2][expressions: (n2_3:INTEGER, "n0_0"), (n2_4:INTEGER, "n0_1"), (n2_5:INTEGER, "n0_2"), (n2_6:ARRAY<ARRAY<INTEGER>>, 1 elements starting at 0 {3 elements starting at 0 {1, 2, 3}})] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, n2_6:ARRAY<ARRAY<INTEGER>> -- Filter[1][expression: lessthanorequal("n0_0",4)] -> n0_0:INTEGER, n0_1:INTEGER, n0_2:INTEGER -- ValueStream[0][] -> n0_0:INTEGER, n0_1:INTEGER, n0_2:INTEGER ``` Note the first Project(line 2) does two things: 1. Rename columns, e.g. n2_3 -> n5_4, but does not change the value of column. 2. Projected an unnecessary column: n5_8, which is the same as n5_7 So it effective does nothing, we should be able to remove it safely. But if we actually remove it, the resulting plan will have duplicated projection column name with different types: ``` -- Unnest[5][C0] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, C0:ARRAY<INTEGER>, C0:INTEGER -- Filter[4][expression: greaterthan(size("C0"),0)] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, C0:ARRAY<INTEGER> -- Unnest[3][n2_6] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, C0:ARRAY<INTEGER> -- Project[2][expressions: (n2_3:INTEGER, "n0_0"), (n2_4:INTEGER, "n0_1"), (n2_5:INTEGER, "n0_2"), (n2_6:ARRAY<ARRAY<INTEGER>>, 1 elements starting at 0 {3 elements starting at 0 {1, 2, 3}})] -> n2_3:INTEGER, n2_4:INTEGER, n2_5:INTEGER, n2_6:ARRAY<ARRAY<INTEGER>> -- Filter[1][expression: lessthanorequal("n0_0",4)] -> n0_0:INTEGER, n0_1:INTEGER, n0_2:INTEGER -- ValueStream[0][] -> n0_0:INTEGER, n0_1:INTEGER, n0_2:INTEGER ``` Note the two `C0` in the first line, I'd like to fix this projection column naming issue first in a dedicated PR, how do you think @marin-ma @zhouyuan ? ### Spark version None ### Spark configurations _No response_ ### System information _No response_ ### Relevant logs _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
