attilapiros edited a comment on pull request #28967: URL: https://github.com/apache/spark/pull/28967#issuecomment-652485109
# Regarding the benchmark The code is in a separate commit where both solution is tested. This benchmark is not intended to be reused just to prove this one time change is well-founded and justified. The commit is on another [branch](https://github.com/attilapiros/spark/tree/SPARK-32149-benchmark) which based on the same as the PR. And the commit with the benchmark is [here]( https://github.com/attilapiros/spark/commit/b66ffac5c8e2678d7cbb2294cb23a14e1e1e28ad). The code is: ```scala /** * Benchmark for NormalizedInternedPathname. * To run this benchmark: * {{{ * 1. without sbt: * bin/spark-submit --class <this class> --jars <spark core test jar> * 2. build/sbt "core/test:runMain <this class>" * 3. generate result: * SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain <this class>" * Results will be written to "benchmarks/NormalizedInternedPathname-results.txt". * }}} * */ object NormalizedInternedPathnameBenchmark extends BenchmarkBase { val seed = 0x1337 private def normalizePathnames(numIters: Int, newBefore: Boolean): Unit = { val numLocalDir = 100 val numSubDir = 100 val numFilenames = 100 val sumPathNames = numLocalDir * numSubDir * numFilenames val benchmark = new Benchmark(s"Normalize pathnames newBefore=$newBefore", sumPathNames, output = output) val localDir = s"/a//b//c/d/e//f/g//$newBefore" val files = (1 to numLocalDir).flatMap { localDirId => (1 to numSubDir).flatMap { subDirId => (1 to numFilenames).map { filenameId => (localDir + localDirId, subDirId.toString, s"filename_$filenameId") } } } val namedNewMethod = "new" -> normalizeNewMethod val namedOldMethod = "old" -> normalizeOldMethod val ((firstName, firstMethod), (secondName, secondMethod)) = if (newBefore) (namedNewMethod, namedOldMethod) else (namedOldMethod, namedNewMethod) benchmark.addCase( s"Normalize with the $firstName method", numIters) { _ => firstMethod(files) } benchmark.addCase( s"Normalize with the $secondName method", numIters) { _ => secondMethod(files) } benchmark.run() } private val normalizeOldMethod = (files: Seq[(String, String, String)]) => { files.map { case (localDir, subDir, filename) => ExecutorDiskUtils.createNormalizedInternedPathname(localDir, subDir, filename) }.size } private val normalizeNewMethod = (files: Seq[(String, String, String)]) => { files.map { case (localDir, subDir, filename) => new File(s"localDir${File.separator}subDir${File.separator}filename").getPath().intern() }.size } override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val numIters = 25 runBenchmark("Normalize pathnames new method first") { normalizePathnames(numIters, newBefore = true) } runBenchmark("Normalize pathnames old method first") { normalizePathnames(numIters, newBefore = false) } } } ``` So it runs the new and old method for a 1000000 paths for 25 iteration then do the same for 1000000 other paths but first the old method is used then the new one. The reason behind to test both method in both way (one is first the other is second) is the assumption that string interning might be different when it is first used on a string and when there is a match second time. # The benchmark result ``` ================================================================================================ Normalize pathnames new method first ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz Normalize pathnames newBefore=true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Normalize with the new method 252 259 10 4.0 252.2 1.0X Normalize with the old method 1727 2018 162 0.6 1726.6 0.1X ================================================================================================ Normalize pathnames old method first ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5 Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz Normalize pathnames newBefore=false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Normalize with the old method 1812 2065 153 0.6 1812.3 1.0X Normalize with the new method 252 254 2 4.0 252.0 7.2X ``` So the new method is about 7-10 times better than old one. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
