bmorck opened a new issue, #1092:
URL: https://github.com/apache/datafusion-comet/issues/1092
### Describe the bug
I'm working on some internal benchmarks using Comet with Spark 3.3 and
Iceberg. To support iceberg, we are including the config, which inserts the
`CometSparkToColumnar` nodes following the BatchScans
> "spark.comet.convert.parquet.enabled" = "true"
"spark.comet.sparkToColumnar.supportedOperatorList" = "BatchScan"
In the following example, we see that there is a missing `ColumnarToRow`
preceding the `Project (19)` operator. This results in the query failing. After
analyzing the query optimization, I've found that the
`EliminateRedundantTransitions` rule, removed a `CometSparkToColumnar` and
subsequent `ColumnarToRow` following the `BatchScan (20)` operator, due to the
subsequent `Filter (21)` operator requiring row format.
` CometSort (27)
+- CometColumnarExchange (26)
+- BroadcastNestedLoopJoin Cross BuildRight (25)
:- Project (19)
: +- CometSortMergeJoin (18)
: :- CometSort (6)
: : +- ShuffleQueryStage (5)
: : +- CometExchange (4)
: : +- CometFilter (3)
: : +- CometSparkToColumnar (2)
: : +- BatchScan (1)
: +- CometSort (17)
: +- CometExchange (16)
: +- CometFilter (15)
: +- CometHashAggregate (14)
: +- ShuffleQueryStage (13)
: +- CometExchange (12)
: +- CometHashAggregate (11)
: +- CometProject (10)
: +- CometFilter (9)
: +- CometSparkToColumnar (8)
: +- BatchScan (7)
+- BroadcastQueryStage (24)
+- BroadcastExchange (23)
+- * Project (22)
+- * Filter (21)
+- BatchScan (20)`
I've modified the filter to be able to be converted to native and we see the
query inserts the appropriate `ColumnarToRow` transitions, as shown below. It's
unclear if this is a bug in Sparks `ApplyColumnarRulesAndInsertTransitions`
rule or if this is unique to Comet, but it seems like incorrect behavior when
using the `CometSparkToColumnarNode`
` ColumnarToRow (34)
+- CometSort (33)
+- AQEShuffleRead (32)
+- ShuffleQueryStage (31), Statistics(sizeInBytes=4.2 KiB,
rowCount=36)
+- CometColumnarExchange (30)
+- BroadcastNestedLoopJoin Cross BuildRight (29)
:- Project (21)
: +- ColumnarToRow (20)
: +- CometBroadcastHashJoin (19)
: :- BroadcastQueryStage (8),
Statistics(sizeInBytes=319.7 KiB, rowCount=583)
: : +- CometBroadcastExchange (7)
: : +- AQEShuffleRead (6)
: : +- ShuffleQueryStage (5),
Statistics(sizeInBytes=409.3 KiB, rowCount=583)
: : +- CometExchange (4)
: : +- CometFilter (3)
: : +- CometSparkToColumnar (2)
: : +- BatchScan (1)
: +- CometFilter (18)
: +- CometHashAggregate (17)
: +- AQEShuffleRead (16)
: +- ShuffleQueryStage (15),
Statistics(sizeInBytes=2040.0 B, rowCount=10)
: +- CometExchange (14)
: +- CometHashAggregate (13)
: +- CometProject (12)
: +- CometFilter (11)
: +- CometSparkToColumnar
(10)
: +- BatchScan (9)
+- BroadcastQueryStage (28), Statistics(sizeInBytes=864.0
B, rowCount=36)
+- BroadcastExchange (27)
+- ColumnarToRow (26)
+- CometProject (25)
+- CometFilter (24)
+- CometSparkToColumnar (23)
+- BatchScan (22)
`
### Steps to reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]