Github user fhueske commented on the pull request:
https://github.com/apache/flink/pull/1255#issuecomment-165732606
Range partitioning serves two purposes:
1. producing fully sorted results.
2. evenly balancing the load in case of skewed key distributions.
Producing sorted results is working fine. However, producing balanced
partitions does not seem to work so well. Looking at the numbers I posted, the
partitions produced by the range partitioner are less balanced than the hash
partitioned ones (records-in / bytes-in). The difference is not huge, but still
range partitioning should be able to do better than hash partitioning.
I proposed to increase the sample size, because this should improve the
accuracy of the histogram without having a (measurable) impact on the
performance. If we pay so much time to generate a histogram, the histogram
should be accurate enough to result in balanced partitions.
Can you explain how you calculated the sample size of `parallelism * 20`?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---