Aljoscha Krettek commented on FLINK-1725:

I think this issue is a bit more complicated than it looks on the surface: 
Flink uses the partitioner to both partition state and incoming elements. It is 
a strict requirement that this partitioning is always the same, especially 
across version numbers and when restoring save points with different 
parallelism, which causes state to be reshuffled according to the partitioner.

Simply adding a new partitioner is easy but if you cannot use keyed state, what 
do you use it for? And if you don't have keyed state, then a simply round-robin 
or random reshuffle will do the trick.

What do you think, [~srichter]?

> New Partitioner for better load balancing for skewed data
> ---------------------------------------------------------
>                 Key: FLINK-1725
>                 URL: https://issues.apache.org/jira/browse/FLINK-1725
>             Project: Flink
>          Issue Type: Improvement
>          Components: DataStream API
>    Affects Versions: 0.8.1
>            Reporter: Anis Nasir
>            Assignee: Anis Nasir
>              Labels: LoadBalancing, Partitioner
>   Original Estimate: 336h
>  Remaining Estimate: 336h
> Hi,
> We have recently studied the problem of load balancing in Storm [1].
> In particular, we focused on key distribution of the stream for skewed data.
> We developed a new stream partitioning scheme (which we call Partial Key 
> Grouping). It achieves better load balancing than key grouping while being 
> more scalable than shuffle grouping in terms of memory.
> In the paper we show a number of mining algorithms that are easy to implement 
> with partial key grouping, and whose performance can benefit from it. We 
> think that it might also be useful for a larger class of algorithms.
> Partial key grouping is very easy to implement: it requires just a few lines 
> of code in Java when implemented as a custom grouping in Storm [2].
> For all these reasons, we believe it will be a nice addition to the standard 
> Partitioners available in Flink. If the community thinks it's a good idea, we 
> will be happy to offer support in the porting.
> References:
> [1]. 
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> [2]. https://github.com/gdfm/partial-key-grouping

This message was sent by Atlassian JIRA

Reply via email to