[
https://issues.apache.org/jira/browse/CALCITE-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996956#comment-16996956
]
Danny Chen commented on CALCITE-3594:
-------------------------------------
Assume you have grouping keys: k1, k2 with hot keys:
k1: a, b, c
k2: d, e, f
You could write a hint statement like
{code:java}
select /*+ AGG_HOT_KEY(k1='a,c,c', k2='d,e,f') */ ...
{code}
- You should add a HintStrategy into the HintStrategies
- You should not add a hint syntax after the group by keyword, we always try to
keep the hint statement either after the SELECT keyword or table reference, i
don't want the hint syntax to be a mess(appears everywhere in a query
statement), that is why we did a lot of effort to support hints propagation
- There is already an issue CALCITE-3590 to support Aggregate node hint
- I'm confused with your requirements somehow, why you need multiple RelHint
for a single group key, do you mean the hint item ?
> Support hot Groupby keys hint
> -----------------------------
>
> Key: CALCITE-3594
> URL: https://issues.apache.org/jira/browse/CALCITE-3594
> Project: Calcite
> Issue Type: Sub-task
> Reporter: Rui Wang
> Assignee: Rui Wang
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> It will be useful for Apache Beam if we support the following SqlHint:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)
> The hot key strategy works on aggregation and it provides a list of hot keys
> with fanout factor for a column. The fanout factor says how many partition
> should be created for that specific key, such that we can have a per
> partition aggregate and then have a final aggregate. One example to explain
> it:
> SELECT * FROM t
> GROUP BY t.key_column /* + hot_key("value1"=2) */)
> // for the key_column, there is a "value1" which appear so many times (so
> it's hot), please consider split it into two partition and process separately.
> Such problem is common for big data processing, where hot key creates slowest
> machine which either slow down the whole pipeline or make retries. In such
> case, one common resolution is to split data to multiple partition and
> aggregate per partition, and then have a final combine.
> Usually execution engine won't know what is the hot key(s). SqlHint provides
> a good way to tell the engine which key is useful to deal with it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)