[ 
https://issues.apache.org/jira/browse/CALCITE-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015676#comment-17015676
 ] 

Rui Wang commented on CALCITE-3594:
-----------------------------------

[~danny0405] Sorry for late response to this Jira. I wanted to focus on TUMBLE 
TVF work so delayed this work a bit (and thanks for all your code review there!)

Now let me go back to discussion and answer your questions:

to 1: say `key1` and `key2` have real values and they are not just literal. For 
example, `select * from table group by key1 and key2`. In this query, key1 will 
indicate a column with its values, and same for key2. (and a few keys in key1 
and key2 columns could be hot keys).

to 2: That was my mistake. Basically, for each value (hot key) of a grouping 
key, we need specify a parameter for it. The value, of source, is the type of 
the column, and the parameter is a integer. Use an example:

here is a table T with a binary column col_1 and varchar column col_2

||col_1||col_2||
|x123|"data"|
|x124|"data1"|
|x123|"data2"|
|x125|"data3"|

then let's have a query to do select max(col_2) from T group by col_1, and 
assume x123 is the hot value (or call it hot key) in col_1. I will want to 
specify a hint item to say: hey, for col_1, there is a hot value x123, with a 
fanout parameter 12. That's why I need to specify a column name first, followed 
by a pair, in which the key is a column value and value is a integer as a 
parameter.



> Support hot Groupby keys hint
> -----------------------------
>
>                 Key: CALCITE-3594
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3594
>             Project: Calcite
>          Issue Type: Sub-task
>            Reporter: Rui Wang
>            Assignee: Rui Wang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> It will be useful for Apache Beam if we support the following SqlHint:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)
> The hot key strategy works on aggregation and it provides a list of hot keys 
> with fanout factor for a column. The fanout factor says how many partition 
> should be created for that specific key, such that we can have a per 
> partition aggregate and then have a final aggregate. One example to explain 
> it:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key("value1"=2) */)
> // for the key_column, there is a "value1" which appear so many times (so 
> it's hot), please consider split it into two partition and process separately.
> Such problem is common for big data processing, where hot key creates slowest 
> machine which either slow down the whole pipeline or make retries. In such 
> case, one common resolution is to split data to multiple partition and 
> aggregate per partition, and then have a final combine. 
> Usually execution engine won't know what is the hot key(s). SqlHint provides 
> a good way to tell the engine which key is useful to deal with it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to