[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

JoshRosen Thu, 18 Sep 2014 18:18:07 -0700

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2305#discussion_r17765356
  
    --- Diff: docs/programming-guide.md ---
    @@ -286,7 +286,7 @@ We describe operations on distributed datasets later on.
     
     </div>
     
    -One important parameter for parallel collections is the number of *slices* 
to cut the dataset into. Spark will run one task for each slice of the cluster. 
Typically you want 2-4 slices for each CPU in your cluster. Normally, Spark 
tries to set the number of slices automatically based on your cluster. However, 
you can also set it manually by passing it as a second parameter to 
`parallelize` (e.g. `sc.parallelize(data, 10)`).
    +One important parameter for parallel collections is the number of 
*partitions* to cut the dataset into. Spark will run one task for each 
partition of the cluster. Typically you want 2-4 partitions for each CPU in 
your cluster. Normally, Spark tries to set the number of partitions 
automatically based on your cluster. However, you can also set it manually by 
passing it as a second parameter to `parallelize` (e.g. `sc.parallelize(data, 
10)`). Note: the parameter is called numSlices (not numPartitions) to maintain 
backward compatibility.
    --- End diff --
    
    Maybe the "Note:" should mention that in _some_ places we still say 
numSlices (for backwards compatibility with earlier versions of Spark) and that 
"slices" should be considered as a synonym for "partitions"; there are a lot of 
places that use `numPartitions`, etc, so we may want to emphasize that this 
discrepancy only occurs in a few places.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

Reply via email to