[
https://issues.apache.org/jira/browse/CALCITE-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015676#comment-17015676
]
Rui Wang commented on CALCITE-3594:
-----------------------------------
[~danny0405] Sorry for late response to this Jira. I wanted to focus on TUMBLE
TVF work so delayed this work a bit (and thanks for all your code review there!)
Now let me go back to discussion and answer your questions:
to 1: say `key1` and `key2` have real values and they are not just literal. For
example, `select * from table group by key1 and key2`. In this query, key1 will
indicate a column with its values, and same for key2. (and a few keys in key1
and key2 columns could be hot keys).
to 2: That was my mistake. Basically, for each value (hot key) of a grouping
key, we need specify a parameter for it. The value, of source, is the type of
the column, and the parameter is a integer. Use an example:
here is a table T with a binary column col_1 and varchar column col_2
||col_1||col_2||
|x123|"data"|
|x124|"data1"|
|x123|"data2"|
|x125|"data3"|
then let's have a query to do select max(col_2) from T group by col_1, and
assume x123 is the hot value (or call it hot key) in col_1. I will want to
specify a hint item to say: hey, for col_1, there is a hot value x123, with a
fanout parameter 12. That's why I need to specify a column name first, followed
by a pair, in which the key is a column value and value is a integer as a
parameter.
> Support hot Groupby keys hint
> -----------------------------
>
> Key: CALCITE-3594
> URL: https://issues.apache.org/jira/browse/CALCITE-3594
> Project: Calcite
> Issue Type: Sub-task
> Reporter: Rui Wang
> Assignee: Rui Wang
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> It will be useful for Apache Beam if we support the following SqlHint:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)
> The hot key strategy works on aggregation and it provides a list of hot keys
> with fanout factor for a column. The fanout factor says how many partition
> should be created for that specific key, such that we can have a per
> partition aggregate and then have a final aggregate. One example to explain
> it:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key("value1"=2) */)
> // for the key_column, there is a "value1" which appear so many times (so
> it's hot), please consider split it into two partition and process separately.
> Such problem is common for big data processing, where hot key creates slowest
> machine which either slow down the whole pipeline or make retries. In such
> case, one common resolution is to split data to multiple partition and
> aggregate per partition, and then have a final combine.
> Usually execution engine won't know what is the hot key(s). SqlHint provides
> a good way to tell the engine which key is useful to deal with it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)