Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15713 Performance results against the Spark master branch on a 48 core machine running PageRank with 500k pages follow **Vanilla CompactBuffer, no changes, run time and throughput (bytes per second) provided** ``` ScalaSparkPagerank 2016-12-01 13:16:09 259928115 47.933 5422738 ScalaSparkPagerank 2016-12-01 13:22:41 259928115 45.551 5706309 ScalaSparkPagerank 2016-12-01 13:26:31 259928115 46.745 5560554 ScalaSparkPagerank 2016-12-01 13:28:58 259928115 51.699 5027720 ScalaSparkPagerank 2016-12-01 13:33:26 259928115 48.415 5368751 240.343s / 5 = 48.068s avg ``` **The commit here** ``` ScalaSparkPagerank 2016-12-01 10:26:12 259928115 48.706 5336675 ScalaSparkPagerank 2016-12-01 10:37:30 259928115 48.947 5310399 ScalaSparkPagerank 2016-12-01 10:40:16 259928115 49.768 5222796 ScalaSparkPagerank 2016-12-01 12:55:37 259928115 48.873 5318439 ScalaSparkPagerank 2016-12-01 12:58:12 259928115 47.535 5468141 243.829 / 5 = 48.7658s avg ``` Way too similar so attributing this to benchmark noise, without the 51s run this would be a few percentage points worse though **Use an ArrayBuffer (initial capacity of 16, default) instead of CompactBuffer** ``` ScalaSparkPagerank 2016-12-01 13:42:45 259928115 62.190 4179580 ScalaSparkPagerank 2016-12-01 13:55:20 259928115 54.112 4803520 ScalaSparkPagerank 2016-12-01 13:59:06 259928115 60.818 4273868 ScalaSparkPagerank 2016-12-01 14:06:26 259928115 57.428 4526156 ScalaSparkPagerank 2016-12-01 14:35:01 259928115 58.218 4464737 292.766 / 5 = 58.5532s avg ``` **Use an ArrayBuffer (initial capacity of 2) instead of CompactBuffer** ``` ScalaSparkPagerank 2016-12-01 15:31:16 259928115 53.544 4854476 ScalaSparkPagerank 2016-12-01 15:36:32 259928115 58.105 4473420 ScalaSparkPagerank 2016-12-01 15:38:45 259928115 53.976 4815623 ScalaSparkPagerank 2016-12-01 15:44:09 259928115 55.174 4711061 ScalaSparkPagerank 2016-12-01 15:50:01 259928115 55.084 4718758 275.883 / 5 = 55.1766s avg ``` With my tests I see that using an ArrayBuffer is noticeably worse, so I'll continue to look into what's going on to see if we can improve performance here as this is definitely a hot codepath for this particular algorithm
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org