GitHub user mateiz opened a pull request:
https://github.com/apache/spark/pull/1555
SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy &
cogroup
JIRA: https://issues.apache.org/jira/browse/SPARK-2657
Our current code uses ArrayBuffers for each group of values in groupBy, as
well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of
overhead if there are few values in them, which is likely to happen in cases
such as join. In particular, they have a pointer to an Object[] of size 16 by
default, which is 24 bytes for the array header + 128 for the pointers in
there, plus at least 32 for the ArrayBuffer data structure. This patch replaces
the per-group buffers with a CompactBuffer class that can store up to 2
elements more efficiently (in fields of itself) and acts like an ArrayBuffer
beyond that. For a key's elements in CoGroupedRDD, we use an Array of
CompactBuffers instead of an ArrayBuffer of ArrayBuffers.
There are some changes throughout the code to deal with CoGroupedRDD
returning Array instead. We can also decide not to do that but CoGroupedRDD is
a `@DeveloperAPI` so I think it's okay to change it here.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mateiz/spark compact-groupby
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1555.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1555
----
commit 10f0de1ee86563b5bec6c8f1270a8198d6449393
Author: Matei Zaharia <[email protected]>
Date: 2014-07-23T22:36:45Z
A CompactBuffer that's more memory-efficient than ArrayBuffer for small
buffers
commit ed577ab3fa50de0ed1bd21eae43013ffa6dac51c
Author: Matei Zaharia <[email protected]>
Date: 2014-07-23T22:37:31Z
Use CompactBuffer in groupByKey
commit 9b4c6e811159857c075528dab02f6c4db7688dde
Author: Matei Zaharia <[email protected]>
Date: 2014-07-23T23:05:14Z
Use CompactBuffer in CoGroupedRDD
commit 775110fa6124e090c0aeed6baf7a408be3f30f9a
Author: Matei Zaharia <[email protected]>
Date: 2014-07-23T23:17:12Z
Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
CoGroupedRDD is a @DeveloperApi but this seemed worthwhile.
commit 197cde8dccb4c7dee1c9e6e9460b221988083d9b
Author: Matei Zaharia <[email protected]>
Date: 2014-07-23T23:41:27Z
Make CompactBuffer extend Seq to make its toSeq more efficient
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---