ben-manes commented on a change in pull request #31517: URL: https://github.com/apache/spark/pull/31517#discussion_r603579741
########## File path: core/src/test/scala/org/apache/spark/LocalCacheBenchmark.scala ########## @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import scala.util.Random + +import com.github.benmanes.caffeine.cache.{CacheLoader => CaffeineCacheLoader, Caffeine} +import com.github.benmanes.caffeine.guava.CaffeinatedGuava +import com.google.common.cache.{CacheBuilder, CacheLoader, LoadingCache} + +import org.apache.spark.benchmark.{Benchmark, BenchmarkBase} + +/** + * Benchmark for Guava Cache vs Caffeine. + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class <this class> --jars <spark core test jar> + * 2. build/sbt "core/test:runMain <this class>" + * 3. generate result: + * SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain <this class>" + * Results will be written to "benchmarks/KryoBenchmark-results.txt". + * }}} + */ +object LocalCacheBenchmark extends BenchmarkBase { + + override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { + runBenchmark("Loading Cache") { + val size = 10000 + val parallelism = 8 + val guavaCacheConcurrencyLevel = 8 + val dataset = (1 to parallelism) + .map(_ => Random.shuffle(List.range(0, size))) + .map(list => list.map(i => TestData(i))) Review comment: This distribution is uniformly distributed with only single key overlaps. This means that there are not hot and cold entries, e.g. random eviction has an optimal hit rate. In reality, some entries will be used much more often and follows a power law curve. That is fairly generous distribution for a cache like guava, which uses coarse locking of multiple hash tables. That way the access distribution matches the hash distribution, so ideally spread across all of the locks. In reality, while the hash distribution will be uniform the access distribution is not so a lock holding hot entries will be used much more frequently. In Caffeine's [benchmarks](https://github.com/ben-manes/caffeine/blob/master/caffeine/src/jmh/java/com/github/benmanes/caffeine/cache/GetPutBenchmark.java), it uses a scrambled Zipfian distribution (YCSB's generator). That would show an even larger speedup. More just an fyi that your benchmarks are conservative and you may see a larger gain. Of course, if the caches are not a bottleneck you might not see any benefit except if the eviction policy improves the hit rates in your workloads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
