Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/15713
Performance results against the Spark master branch on a 48 core machine
running PageRank with 500k pages follow
**Vanilla CompactBuffer, no changes, run time and throughput (bytes per
second) provided**
```
ScalaSparkPagerank 2016-12-01 13:16:09 259928115 47.933
5422738
ScalaSparkPagerank 2016-12-01 13:22:41 259928115 45.551
5706309
ScalaSparkPagerank 2016-12-01 13:26:31 259928115 46.745
5560554
ScalaSparkPagerank 2016-12-01 13:28:58 259928115 51.699
5027720
ScalaSparkPagerank 2016-12-01 13:33:26 259928115 48.415
5368751
240.343s / 5 = 48.068s avg
```
**The commit here**
```
ScalaSparkPagerank 2016-12-01 10:26:12 259928115 48.706
5336675
ScalaSparkPagerank 2016-12-01 10:37:30 259928115 48.947
5310399
ScalaSparkPagerank 2016-12-01 10:40:16 259928115 49.768
5222796
ScalaSparkPagerank 2016-12-01 12:55:37 259928115 48.873
5318439
ScalaSparkPagerank 2016-12-01 12:58:12 259928115 47.535
5468141
243.829 / 5 = 48.7658s avg
```
Way too similar so attributing this to benchmark noise, without the 51s run
this would be a few percentage points worse though
**Use an ArrayBuffer (initial capacity of 16, default) instead of
CompactBuffer**
```
ScalaSparkPagerank 2016-12-01 13:42:45 259928115 62.190
4179580
ScalaSparkPagerank 2016-12-01 13:55:20 259928115 54.112
4803520
ScalaSparkPagerank 2016-12-01 13:59:06 259928115 60.818
4273868
ScalaSparkPagerank 2016-12-01 14:06:26 259928115 57.428
4526156
ScalaSparkPagerank 2016-12-01 14:35:01 259928115 58.218
4464737
292.766 / 5 = 58.5532s avg
```
**Use an ArrayBuffer (initial capacity of 2) instead of CompactBuffer**
```
ScalaSparkPagerank 2016-12-01 15:31:16 259928115 53.544
4854476
ScalaSparkPagerank 2016-12-01 15:36:32 259928115 58.105
4473420
ScalaSparkPagerank 2016-12-01 15:38:45 259928115 53.976
4815623
ScalaSparkPagerank 2016-12-01 15:44:09 259928115 55.174
4711061
ScalaSparkPagerank 2016-12-01 15:50:01 259928115 55.084
4718758
275.883 / 5 = 55.1766s avg
```
With my tests I see that using an ArrayBuffer is noticeably worse, so I'll
continue to look into what's going on to see if we can improve performance here
as this is definitely a hot codepath for this particular algorithm
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]