Github user fhueske commented on the pull request:

    https://github.com/apache/flink/pull/1255#issuecomment-165732606
  
    Range partitioning serves two purposes:
    
    1. producing fully sorted results.
    2. evenly balancing the load in case of skewed key distributions.
    
    Producing sorted results is working fine. However, producing balanced 
partitions does not seem to work so well. Looking at the numbers I posted, the 
partitions produced by the range partitioner are less balanced than the hash 
partitioned ones (records-in / bytes-in). The difference is not huge, but still 
range partitioning should be able to do better than hash partitioning.
    
    I proposed to increase the sample size, because this should improve the 
accuracy of the histogram without having a (measurable) impact on the 
performance. If we pay so much time to generate a histogram, the histogram 
should be accurate enough to result in balanced partitions.
    
    Can you explain how you calculated the sample size of `parallelism * 20`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to