jiangjiangtian opened a new pull request, #10475:
URL: https://github.com/apache/incubator-gluten/pull/10475
## What changes are proposed in this pull request?
If the generator is an instance of `HiveGenericUDTF`, then the
`GenerateExec` can't be offloaded. After the evaluation of `GenerateExec`, the
number of rows may be several times more than that before the evaluation of
`GenerateExec`, so the c2r and r2c will have a huge cost. Columnar partial
generate can eliminate unnecessary conversion of some columns and use arrow
columnar batch to accelerate the evaluation. In our test, we can have a
performance gain of about 10%-20%.
Here is an example. Before the columnar partial generate:
```
+- ^(1) FilterExecTransformer (isnotnull(docid#8) AND (cast(docid#8 as
double) > 20.0))
+- ^(1) InputIteratorTransformer[str1#5, str2#6, type#7, docid#8,
docfeat#9]
+- RowToVeloxColumnar
+- Generate xxx
+- VeloxColumnarToRow
+- FilterExecTransformer (isnotnull(str1#26) AND (str1#26 = 5))
+- Scan hive default.jt_test [str1#5, str2#6, type#7],
HiveTableRelation [`default`.`jt_test`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [str1#5, str2#6,
type#7], Partition Cols: []]
```
After:
```
VeloxColumnarToRow
+- ^(3) FilterExecTransformer (isnotnull(docid#29) AND (cast(docid#29 as
double) > 20.0))
+- ^(3) InputIteratorTransformer[str1#26, str2#27, type#28, docid#29,
docfeat#30]
+- ColumnarPartialGenerate Generate xxx
+- FilterExecTransformer (isnotnull(str1#26) AND (str1#26 = 5))
+- Scan hive default.jt_pg_test [str1#26, str2#27, type#28],
HiveTableRelation [`default`.`jt_pg_test`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [str1#26,
str2#27, type#28], Partition Cols: []]
```
We can set `spark.gluten.sql.columnar.partial.generate` to false to disable
columnar partial generate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]