Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/15713
Thanks for the prompt feedback, we found this opportunity when profiling
Spark 1.6.2 with HiBench large and again this showed up as a hot method with
the PageRank benchmark, we can gather data to see if it's still hot with Spark
2 also and I'm planning to contribute lots of similar improvements
Paraphrasing from a colleague:
> This data structure is the backing data structure used by RDDs that are
doing group by operations (we saw it from a PairRDD doing a groupByKey in
PageRank)
>
> The downside of the existing implementation is that every method in this
class has an if ... else ... if ... else ... which handles element 0, element 1
and then everything else respectively
>
> We found that on PageRank this change provides a throughput boost of
around 5% and costs us about 1 MB of estimated RDD size (86.5 MB to just under
88 MB)_
Note that with our testing using OpenJDK 8 we didn't see a noticeable
performance improvement (nor a regression) despite the very slight footprint
increase (an increase of 2 MB instead of 1.5 MB), ideally we'll improve the
performance for everybody so there may be scope for optimisations here that'll
be of use to OpenJDK users too
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]