Github user ash211 commented on the pull request:
https://github.com/apache/spark/pull/2117#issuecomment-54596140
@nchammas I'm guessing your OOM issue is unrelated to this one.
```
a = sc.parallelize(["Nick", "John", "Bob"])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y,
sc.defaultParallelism).take(1)
```
After the reduceByKey above, you'd have 24000 partitions and only 2 entries
in them: (4, "NickJohn") and (3, "Bob"). This bug manifests when you have an
empty partition 0 and many remaining partitions each with a large amount of
data. The .take(n) gets up to the first n from each remaining partition and
then takes the first n from the concatenation of those arrays.
For this bug to take effect on your situation you'd have to have an empty
first partition (a good 23998/24000 chance). The driver would then bring into
memory 23998 empty arrays and 2 arrays of size 1 (or maybe 1 array of size 2),
which I can't imagine would OOM the driver. So I don't think this is your bug.
The other evidence is that you observed a regression (at least the perf
numbers later in your bug) and this has been the same for quite some time. The
current behavior was implemented in commit
42571d30d0d518e69eecf468075e4c5a823a2ae8 and was first released in version 0.9:
```
aash@aash-mbp ~/git/spark$ git log origin/branch-1.0 | grep
42571d30d0d518e69eecf468075e4c5a823a2ae8
commit 42571d30d0d518e69eecf468075e4c5a823a2ae8
aash@aash-mbp ~/git/spark$ git log origin/branch-0.9 | grep
42571d30d0d518e69eecf468075e4c5a823a2ae8
commit 42571d30d0d518e69eecf468075e4c5a823a2ae8
aash@aash-mbp ~/git/spark$ git log origin/branch-0.8 | grep
42571d30d0d518e69eecf468075e4c5a823a2ae8
aash@aash-mbp ~/git/spark$
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]