lgbo-ustc opened a new issue, #8138:
URL: https://github.com/apache/incubator-gluten/issues/8138
### Description
We have met a lot of queries which have aggregations with `distinct`. For
example
```sql
0: jdbc:hive2://localhost:10000> explain select n_regionkey,
sum(n_nationkey), count(distinct n_name) from nation group by n_regionkey;
+----------------------------------------------------+
| plan |
+----------------------------------------------------+
| == Physical Plan ==
CHNativeColumnarToRow
+- ^(6) HashAggregateTransformer(keys=[n_regionkey#2L],
functions=[sum(n_nationkey#0L), count(distinct n_name#1)], isStreamingAgg=false)
+- ^(6) InputIteratorTransformer[n_regionkey#2L, sum#35L, count#38L]
+- ColumnarExchange hashpartitioning(n_regionkey#2L, 5),
ENSURE_REQUIREMENTS, [plan_id=313], [shuffle_writer_type=hash], [OUTPUT]
ArrayBuffer(n_regionkey:LongType, sum:LongType, count:LongType)
+- ^(5) HashAggregateTransformer(keys=[n_regionkey#2L],
functions=[merge_sum(n_nationkey#0L), partial_count(distinct n_name#1)],
isStreamingAgg=false)
+- ^(5) HashAggregateTransformer(keys=[n_regionkey#2L,
n_name#1], functions=[merge_sum(n_nationkey#0L)], isStreamingAgg=false)
+- ^(5) InputIteratorTransformer[n_regionkey#2L, n_name#1,
sum#35L]
+- ColumnarExchange hashpartitioning(n_regionkey#2L,
n_name#1, 5), ENSURE_REQUIREMENTS, [plan_id=307], [shuffle_writer_type=hash],
[OUTPUT] ArrayBuffer(n_regionkey:LongType, n_name:StringType, sum:LongType)
+- ^(4) HashAggregateTransformer(keys=[n_regionkey#2L,
n_name#1], functions=[partial_sum(n_nationkey#0L)], isStreamingAgg=false)
+- ^(4) FileScanTransformer parquet
tpch_pq.nation[n_nationkey#0L,n_name#1,n_regionkey#2L] Batched: true,
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1
paths)[file://...., PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<n_nationkey:bigint,n_name:string,n_regionkey:bigint>
|
+----------------------------------------------------+
```
There are two aggregation steps here, the 1st step is to make `n_name`
unique. If `n_regionkey` or `n_name` is high cardinality, the following two
operations are expected to be high cost.
We cannot use `count_distinct` to merge the two aggregations steps into one
simply. There could be data skew which will cause OOM.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]