Hi Martin, Flink does not have features to mitigate data skew at the moment, such as dynamic partitioning. That would also "only" allow to process large groups as an individual partitions and multiple smaller groups together in other partitions. The issue of having a large group would not be solved with that. This is more on the application-level right now and could for example be solved by adding something like a group-cross operator...
I think your approach of emitting multiple smaller partitions from a group-reduce, reshuffle (there is a rebalance operator [1]), and apply a flatmap sounds like a good idea to me. At least, I didn't come up with a better approach ;-) Cheers, Fabian [1] http://flink.incubator.apache.org/docs/0.7-incubating/programming_guide.html#transformations 2014-10-28 20:53 GMT+01:00 Martin Neumann <[email protected]>: > I have some problem with load balancing and was wondering how to deal with > this kind of problem in Flink. > The input I have is a data set of grouped ID's that I join with metadata > for each ID. Then I need to compare each Item in a group with each other > item in that group and if necessary splitting it into different subgroups. > In flink its a join followed by a group reduce. > > The problem is that the groups differ a lot in size. 90% of the groups are > done in 5 minutes while the rest takes 2 hours. In order to get this more > efficient I would need to distribute the N to N comparison that currently > is done in the group reduce function. Anyone has an idea how I can do that > in a simple way? > > My current Idea is to make the group reduce step emit computation > partitions and then do another flat-map step to do the actual computation. > Would this solve the problem? > > cheers Martin >
