[ 
https://issues.apache.org/jira/browse/CALCITE-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996973#comment-16996973
 ] 

Rui Wang commented on CALCITE-3594:
-----------------------------------

[~danny0405]

I see. To answer my requirement about hot key, I will adjust your example a bit:

select /*+ AGG_HOT_KEY(k1='a:10,b:2;c:7', k2='d:7,e:2,f:9') */

So basically each value of key needs an additional factor to indicate how hot 
it is (which will be used as number of partition to process all associated 
data). If using the existing identifier:string support, I will need to add that 
information to string, and split by, e.g. ":". That's why I used a 
KV<Literal:Literal> in my PR, which is much clear, and thus it leads to I put 
the hints into GROUP clause. If you prefer to put hints in SELECT, can we 
support a another structure in hint syntax: /*+ AGG_HOT_KEY(k1=[a:10, b=2, 
c=7]) */?

As for CALCITE-3590, I am not sure what that JIRA progress since there is no 
person is assigned and no PR ready. Maybe this JIRA and that JIRA can work 
together somehow?

Lastly, multiple RelHint for a single group by key was just trying to have a 
general design. For example, I imagined there could be a mixed of hint 
strategies used for one single key:  "group by /*+ AGG_HOT_KEY(), AGG_HINT_2() 
*/ key_column. 

> Support hot Groupby keys hint
> -----------------------------
>
>                 Key: CALCITE-3594
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3594
>             Project: Calcite
>          Issue Type: Sub-task
>            Reporter: Rui Wang
>            Assignee: Rui Wang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> It will be useful for Apache Beam if we support the following SqlHint:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)
> The hot key strategy works on aggregation and it provides a list of hot keys 
> with fanout factor for a column. The fanout factor says how many partition 
> should be created for that specific key, such that we can have a per 
> partition aggregate and then have a final aggregate. One example to explain 
> it:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key("value1"=2) */)
> // for the key_column, there is a "value1" which appear so many times (so 
> it's hot), please consider split it into two partition and process separately.
> Such problem is common for big data processing, where hot key creates slowest 
> machine which either slow down the whole pipeline or make retries. In such 
> case, one common resolution is to split data to multiple partition and 
> aggregate per partition, and then have a final combine. 
> Usually execution engine won't know what is the hot key(s). SqlHint provides 
> a good way to tell the engine which key is useful to deal with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to