[ 
https://issues.apache.org/jira/browse/CALCITE-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000404#comment-17000404
 ] 

Rui Wang commented on CALCITE-3594:
-----------------------------------

[~danny0405] thanks!

I revise a bit on you suggestion:

```
 /*+ AGG_HOT_KEY(`key1`=`a:10,b:2;c:7`, `key2`=`d:7,e:2,f:9`) */
```

Because not every key is hot so it has to be a key value pair of Identifiers. I 
opened a PR accordingly: https://github.com/apache/calcite/pull/1677


I would want to try this idea in BeamSQL and then to see if we have future need 
to revise it.

> Support hot Groupby keys hint
> -----------------------------
>
>                 Key: CALCITE-3594
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3594
>             Project: Calcite
>          Issue Type: Sub-task
>            Reporter: Rui Wang
>            Assignee: Rui Wang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> It will be useful for Apache Beam if we support the following SqlHint:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)
> The hot key strategy works on aggregation and it provides a list of hot keys 
> with fanout factor for a column. The fanout factor says how many partition 
> should be created for that specific key, such that we can have a per 
> partition aggregate and then have a final aggregate. One example to explain 
> it:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key("value1"=2) */)
> // for the key_column, there is a "value1" which appear so many times (so 
> it's hot), please consider split it into two partition and process separately.
> Such problem is common for big data processing, where hot key creates slowest 
> machine which either slow down the whole pipeline or make retries. In such 
> case, one common resolution is to split data to multiple partition and 
> aggregate per partition, and then have a final combine. 
> Usually execution engine won't know what is the hot key(s). SqlHint provides 
> a good way to tell the engine which key is useful to deal with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to