[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

sryza Mon, 23 Feb 2015 08:51:44 -0800

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/3694#issuecomment-75580971
  
    In general, a fixed number of partitions is very difficult to work with 
when configuring a shuffle.  Suppose I have a job where I know a `flatMap` is 
going to blow up the size of my data by two.  If I want to minimize reduce-side 
spilling in a shuffle that comes after the `flatMap`, I want the parallelism of 
the shuffle to be double that of the input stage.  Because the size of my input 
data could change between different runs of my job, a ratio is a much more 
natural way to express my needs than a constant.
    
    It's unclear to me whether a global default is useful at all, but a 
configurable parallelism ratio per shuffle operation definitely is.  (Systems 
like Crunch take this approach).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

Reply via email to