Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/2117#issuecomment-54596140
  
    @nchammas I'm guessing your OOM issue is unrelated to this one.
    
    ```
    a = sc.parallelize(["Nick", "John", "Bob"])
    a = a.repartition(24000)
    a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y, 
sc.defaultParallelism).take(1)
    ```
    
    After the reduceByKey above, you'd have 24000 partitions and only 2 entries 
in them: (4, "NickJohn") and (3, "Bob").  This bug manifests when you have an 
empty partition 0 and many remaining partitions each with a large amount of 
data.  The .take(n) gets up to the first n from each remaining partition and 
then takes the first n from the concatenation of those arrays.
    
    For this bug to take effect on your situation you'd have to have an empty 
first partition (a good 23998/24000 chance).  The driver would then bring into 
memory 23998 empty arrays and 2 arrays of size 1 (or maybe 1 array of size 2), 
which I can't imagine would OOM the driver.  So I don't think this is your bug.
    
    The other evidence is that you observed a regression (at least the perf 
numbers later in your bug) and this has been the same for quite some time.  The 
current behavior was implemented in commit 
42571d30d0d518e69eecf468075e4c5a823a2ae8 and was first released in version 0.9:
    
    ```
    aash@aash-mbp ~/git/spark$ git log origin/branch-1.0 | grep 
42571d30d0d518e69eecf468075e4c5a823a2ae8
    commit 42571d30d0d518e69eecf468075e4c5a823a2ae8
    aash@aash-mbp ~/git/spark$ git log origin/branch-0.9 | grep 
42571d30d0d518e69eecf468075e4c5a823a2ae8
    commit 42571d30d0d518e69eecf468075e4c5a823a2ae8
    aash@aash-mbp ~/git/spark$ git log origin/branch-0.8 | grep 
42571d30d0d518e69eecf468075e4c5a823a2ae8
    aash@aash-mbp ~/git/spark$
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to