[GitHub] [spark] attilapiros edited a comment on pull request #28967: [WIP][SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service

GitBox Wed, 01 Jul 2020 08:27:25 -0700


attilapiros edited a comment on pull request #28967:
URL: https://github.com/apache/spark/pull/28967#issuecomment-652485109



   # Regarding the benchmark
   
   The code is in a separate commit where both solution is tested. This 
benchmark is not intended to be reused just to prove this one time change is 
well-founded and justified. 
   
   The commit is on another 
[branch](https://github.com/attilapiros/spark/tree/SPARK-32149-benchmark) which 
based on the same as the PR. And the commit with the benchmark is [here]( 
https://github.com/attilapiros/spark/commit/b66ffac5c8e2678d7cbb2294cb23a14e1e1e28ad).
   
   The code is:
   ```scala
   /**
    * Benchmark for NormalizedInternedPathname.
    * To run this benchmark:
    * {{{
    *   1. without sbt:
    *      bin/spark-submit --class <this class> --jars <spark core test jar>
    *   2. build/sbt "core/test:runMain <this class>"
    *   3. generate result:
    *      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain <this 
class>"
    *      Results will be written to 
"benchmarks/NormalizedInternedPathname-results.txt".
    * }}}
    * */
   object NormalizedInternedPathnameBenchmark extends BenchmarkBase {
     val seed = 0x1337
   
   
     private def normalizePathnames(numIters: Int, newBefore: Boolean): Unit = {
       val numLocalDir = 100
       val numSubDir = 100
       val numFilenames = 100
       val sumPathNames = numLocalDir * numSubDir * numFilenames
       val benchmark =
         new Benchmark(s"Normalize pathnames newBefore=$newBefore", 
sumPathNames, output = output)
       val localDir = s"/a//b//c/d/e//f/g//$newBefore"
       val files = (1 to numLocalDir).flatMap { localDirId =>
         (1 to numSubDir).flatMap { subDirId =>
           (1 to numFilenames).map { filenameId =>
             (localDir + localDirId, subDirId.toString, s"filename_$filenameId")
           }
         }
       }
       val namedNewMethod = "new" -> normalizeNewMethod
       val namedOldMethod = "old" -> normalizeOldMethod
   
       val ((firstName, firstMethod), (secondName, secondMethod)) =
         if (newBefore) (namedNewMethod, namedOldMethod) else (namedOldMethod, 
namedNewMethod)
   
   
       benchmark.addCase(
         s"Normalize with the $firstName method", numIters) { _ =>
           firstMethod(files)
       }
       benchmark.addCase(
         s"Normalize with the $secondName method", numIters) { _ =>
           secondMethod(files)
       }
       benchmark.run()
     }
   
   
     private val normalizeOldMethod = (files: Seq[(String, String, String)]) => 
{
       files.map { case (localDir, subDir, filename) =>
         ExecutorDiskUtils.createNormalizedInternedPathname(localDir, subDir, 
filename)
       }.size
     }
   
   
     private val normalizeNewMethod = (files: Seq[(String, String, String)]) => 
{
       files.map { case (localDir, subDir, filename) =>
         new 
File(s"localDir${File.separator}subDir${File.separator}filename").getPath().intern()
       }.size
     }
   
   
     override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
       val numIters = 25
       runBenchmark("Normalize pathnames new method first") {
         normalizePathnames(numIters, newBefore = true)
       }
       runBenchmark("Normalize pathnames old method first") {
         normalizePathnames(numIters, newBefore = false)
       }
     }
   }
   ```
   
   So it runs the new and old method for a 1000000 paths for 25 iteration then 
do the same for 1000000 other paths but first the old method is used then the 
new one. The reason behind to test both method in both way (one is first the 
other is second) is the assumption that string interning might be different 
when it is first used on a string and when there is a match second time.
   
   # The benchmark result 
   
   ```
   
================================================================================================
   Normalize pathnames new method first
   
================================================================================================
   
   OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   Normalize pathnames newBefore=true:       Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   Normalize with the new method                       252            259       
   10          4.0         252.2       1.0X
   Normalize with the old method                      1727           2018       
  162          0.6        1726.6       0.1X
   
   
   
================================================================================================
   Normalize pathnames old method first
   
================================================================================================
   
   OpenJDK 64-Bit Server VM 1.8.0_242-b08 on Mac OS X 10.15.5
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   Normalize pathnames newBefore=false:      Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   Normalize with the old method                      1812           2065       
  153          0.6        1812.3       1.0X
   Normalize with the new method                       252            254       
    2          4.0         252.0       7.2X
   
   ```
   
   So the new method is about 7-10 times better than old one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] attilapiros edited a comment on pull request #28967: [WIP][SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service

Reply via email to