[GitHub] spark pull request: SPARK[1784]: Adding a balancedPartitioner

aarondav Sun, 25 May 2014 18:26:23 -0700

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/876#issuecomment-44151265
  
    The current contract of `Partitioner` (though it's not documented, 
apparently...) is that it is expected to be idempotent and that if two keys are 
equivalent, they are assigned to the same partition. 
[PairRDDFunctions#lookup](https://github.com/ash211/spark/blob/sortby/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala?pr=%2Fapache%2Fspark%2Fpull%2F369#L558)
 makes this assumption, for instance.
    
    It turns out this sort of balanced partitioning is useful, however, and we 
have encoded it explicitly within 
[RDD#coalesce()](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L328).
 The semantics here match Spark's assumptions about partitioners -- i.e., the 
resultant RDD has no Partitioner, so no assumption can be made about the 
colocation of keys in order to do efficient lookups/groupBys/reduceByKeys.
    
    Would this sort of manual repartitioning suit your use-case? Otherwise it 
would require a rather significant overhaul to Spark's Partitioner semantics.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK[1784]: Adding a balancedPartitioner

Reply via email to