[
https://issues.apache.org/jira/browse/FLINK-12786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-12786:
-----------------------------------
Labels: stale-assigned (was: )
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issue is assigned but has not
received an update in 14, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a
comment updating the community on your progress. If this issue is waiting on
feedback, please consider this a reminder to the committer/reviewer. Flink is a
very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone
else may work on it. If the "warning_label" label is not removed in 7 days, the
issue will be automatically unassigned.
> Implement local aggregation in Flink
> ------------------------------------
>
> Key: FLINK-12786
> URL: https://issues.apache.org/jira/browse/FLINK-12786
> Project: Flink
> Issue Type: New Feature
> Components: API / DataStream
> Reporter: vinoyang
> Assignee: vinoyang
> Priority: Major
> Labels: stale-assigned
>
> Currently, keyed streams are widely used to perform aggregating operations
> (e.g., reduce, sum and window) on the elements that have the same key. When
> executed at runtime, the elements with the same key will be sent to and
> aggregated by the same task.
>
> The performance of these aggregating operations is very sensitive to the
> distribution of keys. In the cases where the distribution of keys follows a
> powerful law, the performance will be significantly downgraded. More
> unluckily, increasing the degree of parallelism does not help when a task is
> overloaded by a single key.
>
> Local aggregation is a widely-adopted method to reduce the performance
> degraded by data skew. We can decompose the aggregating operations into two
> phases. In the first phase, we aggregate the elements of the same key at the
> sender side to obtain partial results. Then at the second phase, these
> partial results are sent to receivers according to their keys and are
> combined to obtain the final result. Since the number of partial results
> received by each receiver is limited by the number of senders, the imbalance
> among receivers can be reduced. Besides, by reducing the amount of
> transferred data the performance can be further improved.
> The design documentation is here:
> [https://docs.google.com/document/d/1gizbbFPVtkPZPRS8AIuH8596BmgkfEa7NRwR6n3pQes/edit?usp=sharing]
> The discussion thread is here:
> [http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CCAA_=o7dvtv8zjcxknxyoyy7y_ktvgexrvb4zhxjwzuhsulz...@mail.gmail.com%3E]
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)