[
https://issues.apache.org/jira/browse/FLINK-12786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-12786:
-----------------------------------
Labels: auto-deprioritized-major auto-deprioritized-minor auto-unassigned
(was: auto-deprioritized-major auto-unassigned stale-minor)
Priority: Not a Priority (was: Minor)
This issue was labeled "stale-minor" 7 days ago and has not received any
updates so it is being deprioritized. If this ticket is actually Minor, please
raise the priority and ask a committer to assign you the issue or revive the
public discussion.
> Implement local aggregation in Flink
> ------------------------------------
>
> Key: FLINK-12786
> URL: https://issues.apache.org/jira/browse/FLINK-12786
> Project: Flink
> Issue Type: New Feature
> Components: API / DataStream
> Reporter: vinoyang
> Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor,
> auto-unassigned
>
> Currently, keyed streams are widely used to perform aggregating operations
> (e.g., reduce, sum and window) on the elements that have the same key. When
> executed at runtime, the elements with the same key will be sent to and
> aggregated by the same task.
>
> The performance of these aggregating operations is very sensitive to the
> distribution of keys. In the cases where the distribution of keys follows a
> powerful law, the performance will be significantly downgraded. More
> unluckily, increasing the degree of parallelism does not help when a task is
> overloaded by a single key.
>
> Local aggregation is a widely-adopted method to reduce the performance
> degraded by data skew. We can decompose the aggregating operations into two
> phases. In the first phase, we aggregate the elements of the same key at the
> sender side to obtain partial results. Then at the second phase, these
> partial results are sent to receivers according to their keys and are
> combined to obtain the final result. Since the number of partial results
> received by each receiver is limited by the number of senders, the imbalance
> among receivers can be reduced. Besides, by reducing the amount of
> transferred data the performance can be further improved.
> The design documentation is here:
> [https://docs.google.com/document/d/1gizbbFPVtkPZPRS8AIuH8596BmgkfEa7NRwR6n3pQes/edit?usp=sharing]
> The discussion thread is here:
> [http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CCAA_=o7dvtv8zjcxknxyoyy7y_ktvgexrvb4zhxjwzuhsulz...@mail.gmail.com%3E]
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)