[GitHub] [spark] ben-manes commented on a change in pull request #31517: [WIP][SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

GitBox Mon, 29 Mar 2021 13:14:06 -0700


ben-manes commented on a change in pull request #31517:
URL: https://github.com/apache/spark/pull/31517#discussion_r603579741




##########
File path: core/src/test/scala/org/apache/spark/LocalCacheBenchmark.scala
##########
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark
+
+import scala.util.Random
+
+import com.github.benmanes.caffeine.cache.{CacheLoader => CaffeineCacheLoader, 
Caffeine}
+import com.github.benmanes.caffeine.guava.CaffeinatedGuava
+import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache}
+
+import org.apache.spark.benchmark.{Benchmark, BenchmarkBase}
+
+/**
+ * Benchmark for Guava Cache vs Caffeine.
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt:
+ *      bin/spark-submit --class <this class> --jars <spark core test jar>
+ *   2. build/sbt "core/test:runMain <this class>"
+ *   3. generate result:
+ *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain <this 
class>"
+ *      Results will be written to "benchmarks/KryoBenchmark-results.txt".
+ * }}}
+ */
+object LocalCacheBenchmark extends BenchmarkBase {
+
+  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    runBenchmark("Loading Cache") {
+      val size = 10000
+      val parallelism = 8
+      val guavaCacheConcurrencyLevel = 8
+      val dataset = (1 to parallelism)
+        .map(_ => Random.shuffle(List.range(0, size)))
+        .map(list => list.map(i => TestData(i)))

Review comment:
       This distribution is uniformly distributed with only single key 
overlaps. This means that there are not hot and cold entries, e.g. random 
eviction has an optimal hit rate. In reality, some entries will be used much 
more often and follows a power law curve.
   
   That is fairly generous distribution for a cache like guava, which uses 
coarse locking of multiple hash tables. That way the access distribution 
matches the hash distribution, so ideally spread across all of the locks. In 
reality, while the hash distribution will be uniform the access distribution is 
not so a lock holding hot entries will be used much more frequently.
   
   In Caffeine's 
[benchmarks](https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java),
 it uses a scrambled Zipfian distribution (YCSB's generator). That would show 
an even larger speedup.
   
   More just an fyi that your benchmarks are conservative and you may see a 
larger gain. Of course, if the caches are not a bottleneck you might not see 
any benefit except if the eviction policy improves the hit rates in your 
workloads.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ben-manes commented on a change in pull request #31517: [WIP][SPARK-34309][BUILD][CORE][SQL][K8S]Use Caffeine instead of Guava Cache

Reply via email to