[I] [CH] Improve aggregation with `distinct` [incubator-gluten]

via GitHub Tue, 03 Dec 2024 20:20:54 -0800


lgbo-ustc opened a new issue, #8138:
URL: https://github.com/apache/incubator-gluten/issues/8138


   ### Description
   
   We have met a lot of queries which have aggregations with `distinct`. For 
example
   ```sql
   0: jdbc:hive2://localhost:10000> explain select n_regionkey, 
sum(n_nationkey), count(distinct n_name) from nation group by n_regionkey;
   +----------------------------------------------------+
   |                        plan                        |
   +----------------------------------------------------+
   | == Physical Plan ==
   CHNativeColumnarToRow
   +- ^(6) HashAggregateTransformer(keys=[n_regionkey#2L], 
functions=[sum(n_nationkey#0L), count(distinct n_name#1)], isStreamingAgg=false)
      +- ^(6) InputIteratorTransformer[n_regionkey#2L, sum#35L, count#38L]
         +- ColumnarExchange hashpartitioning(n_regionkey#2L, 5), 
ENSURE_REQUIREMENTS, [plan_id=313], [shuffle_writer_type=hash], [OUTPUT] 
ArrayBuffer(n_regionkey:LongType, sum:LongType, count:LongType)
            +- ^(5) HashAggregateTransformer(keys=[n_regionkey#2L], 
functions=[merge_sum(n_nationkey#0L), partial_count(distinct n_name#1)], 
isStreamingAgg=false)
               +- ^(5) HashAggregateTransformer(keys=[n_regionkey#2L, 
n_name#1], functions=[merge_sum(n_nationkey#0L)], isStreamingAgg=false)
                  +- ^(5) InputIteratorTransformer[n_regionkey#2L, n_name#1, 
sum#35L]
                     +- ColumnarExchange hashpartitioning(n_regionkey#2L, 
n_name#1, 5), ENSURE_REQUIREMENTS, [plan_id=307], [shuffle_writer_type=hash], 
[OUTPUT] ArrayBuffer(n_regionkey:LongType, n_name:StringType, sum:LongType)
                        +- ^(4) HashAggregateTransformer(keys=[n_regionkey#2L, 
n_name#1], functions=[partial_sum(n_nationkey#0L)], isStreamingAgg=false)
                           +- ^(4) FileScanTransformer parquet 
tpch_pq.nation[n_nationkey#0L,n_name#1,n_regionkey#2L] Batched: true, 
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 
paths)[file://...., PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<n_nationkey:bigint,n_name:string,n_regionkey:bigint>
   
    |
   +----------------------------------------------------+
   ```
   
   There are two aggregation steps here, the 1st step is to make `n_name` 
unique.  If `n_regionkey` or `n_name` is high cardinality, the following two 
operations are expected to be high cost. 
   
   
   We cannot  use `count_distinct` to merge the two aggregations steps into one 
simply. There could be data skew which will cause OOM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [CH] Improve aggregation with `distinct` [incubator-gluten]

Reply via email to