[ 
https://issues.apache.org/jira/browse/FLINK-12786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16873162#comment-16873162
 ] 

vinoyang commented on FLINK-12786:
----------------------------------

Now, the main discussers in the community are me and the developers from 
Alibaba. It seems that we have major differences at the API level. We hope that 
[~aljoscha] and [~StephanEwen] can give more professional advice. Maybe just 
give directions. It seems that the current discussion will be inefficient. This 
wastes a lot of time. Maybe a lot of work could have been done in parallel.

> Implement local aggregation in Flink
> ------------------------------------
>
>                 Key: FLINK-12786
>                 URL: https://issues.apache.org/jira/browse/FLINK-12786
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / DataStream
>            Reporter: vinoyang
>            Assignee: vinoyang
>            Priority: Major
>
> Currently, keyed streams are widely used to perform aggregating operations 
> (e.g., reduce, sum and window) on the elements that have the same key. When 
> executed at runtime, the elements with the same key will be sent to and 
> aggregated by the same task.
>  
> The performance of these aggregating operations is very sensitive to the 
> distribution of keys. In the cases where the distribution of keys follows a 
> powerful law, the performance will be significantly downgraded. More 
> unluckily, increasing the degree of parallelism does not help when a task is 
> overloaded by a single key.
>  
> Local aggregation is a widely-adopted method to reduce the performance 
> degraded by data skew. We can decompose the aggregating operations into two 
> phases. In the first phase, we aggregate the elements of the same key at the 
> sender side to obtain partial results. Then at the second phase, these 
> partial results are sent to receivers according to their keys and are 
> combined to obtain the final result. Since the number of partial results 
> received by each receiver is limited by the number of senders, the imbalance 
> among receivers can be reduced. Besides, by reducing the amount of 
> transferred data the performance can be further improved.
> The design documentation is here: 
> [https://docs.google.com/document/d/1gizbbFPVtkPZPRS8AIuH8596BmgkfEa7NRwR6n3pQes/edit?usp=sharing]
> The discussion thread is here: 
> [http://mail-archives.apache.org/mod_mbox/flink-dev/201906.mbox/%3CCAA_=o7dvtv8zjcxknxyoyy7y_ktvgexrvb4zhxjwzuhsulz...@mail.gmail.com%3E]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to