parthchandra commented on code in PR #577: URL: https://github.com/apache/datafusion-comet/pull/577#discussion_r1643449854
########## spark/inspections/CometTPCHQueriesList-results.txt: ########## @@ -1,133 +1,133 @@ -Query: q1 TPCH Snappy. Comet Exec: Enabled (CometHashAggregate, CometProject) +Query: q1 TPCH Snappy. Comet Exec: Enabled (CometHashAggregate, CometFilter, CometProject) Query: q1 TPCH Snappy: ExplainInfo: Comet shuffle is not enabled: spark.sql.adaptive.coalescePartitions.enabled is enabled and spark.comet.shuffle.enforceMode.enabled is not enabled Query: q2 TPCH Snappy. Comet Exec: Enabled (CometFilter, CometProject) Query: q2 TPCH Snappy: ExplainInfo: -BroadcastExchange is not supported +BroadcastHashJoin is not enabled because not all child plans are native Review Comment: TL;DR To get the reason why a specific operation in a specific query is not native we must look at the extended explain info in conjunction with the plan. One of the goals of the extended explain plan output (perhaps the primary goal) was to run this over a large number of queries and aggregate which operations end users are likely to benefit from. For instance, `xxhash64` appears to be an expression which prevents several TPC-H queries from running natively, so we were able to prioritise the work on it. Getting detailed information for specific queries was not originally a goal, so we subsume some of the extended explain information, preferring to only output root causes. This is also the reason why we do not associate the information with any specific node. In TPC-H q2, there are _seven_ BroadcastHashJoins in the plan. None of them are converted to native because the child plan is not native. Some of the nodes have one arm which is native but the other is a BroadcastExchange, other nodes have a Project as one arm and a BroadcastExchange as the other. Now the Project arm in the node is not native because its child is not native. But there are different reasons for why the Project has a child plan that is not native. In this case, some have a child which is a BroadcastHashJoin and others have a SortMergeJoin. And so on. The only good way to make this something that end users can leverage is to print the entire plan with the explanation on the same line as the node. That has two disadvantages. Firstly, we can no longer aggregate this information because each extended explain plan is unique to the query. Secondly, the extended explain plan becomes verbose and hard to read. I've made an attempt that makes it a little more verbose without getting too query specific. For TPC-H q2 it looks like this - ``` BroadcastHashJoin is not enabled because the following children are not native (BroadcastExchange) BroadcastHashJoin is not enabled because the following children are not native (Project, BroadcastExchange) Comet shuffle is not enabled: spark.sql.adaptive.coalescePartitions.enabled is enabled and spark.comet.shuffle.enforceMode.enabled is not enabled Filter is not native because the following children are not native (HashAggregate) HashAggregate is not native because the following children are not native (Project) ObjectHashAggregate is not supported Project is not native because the following children are not native (BroadcastHashJoin) Project is not native because the following children are not native (SortMergeJoin) Sort is not native because the following children are not native (Exchange) SortMergeJoin is not enabled because the following children are not native (Sort, Sort) TakeOrderedAndProject requires shuffle to be enabled xxhash64 is disabled by default. Set spark.comet.xxhash64.enabled=true to enable it. ``` WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
