[GitHub] spark pull request: Added validation check for parallelizing a seq

liancheng Sat, 05 Apr 2014 06:30:19 -0700

Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/329#issuecomment-39637679
  
    Hi @bijaybisht, thanks for fixing this :) I had once noticed this issue, 
but at last decided not to change this. And my reasons are:
    
    1.  Although not specified in the ScalaDoc, the `numSlices` parameter of 
`SparkContext.parallelize` specifies the *exact* partition number of the result 
RDD. This PR actually changes semantics of the `numSlices` parameter.
    1.  An RDD *can* have more partitions than its elements. For example, 
`RDD.filter` may result empty partitions.
    1.  For APIs like `RDD.zipPartitions`, partition number is significant, and 
this change may break some existing code. For example:
    
        ```scala
        // Using coalesce to ensure we have exactly 4 partitions
        val x = sc.textFile("input", 4).coalesce(4) 
        val y = sc.parallelize(1 to 3, 4)
        val z = x.zipPartitions(y) { (i, j) =>
          ...
        }
        ```
    
        (`x.zipPartitions(y)` requires `x` & `y` have exactly the same 
partition number):



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added validation check for parallelizing a seq

Reply via email to