Github user anisnasir commented on the pull request:
https://github.com/apache/flink/pull/1069#issuecomment-135905274
@tillrohrmann you are absolutely right with your observation that high
skews require more than two workers to process the most frequent keys. However,
most of the real world datasets do not have high skews [1] and can be handled
by just splitting keys into two components.
I was planning to write a wordcount example with both HashPartitioner and
PartialPartitioner. Can you explain a little more on how one could check that
the skew of the input data decreases after the partitioning.
[1].
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---