Re: Spark 2.0.0 performance; potential large Spark core regression

Michael Allman Fri, 08 Jul 2016 09:06:18 -0700

Here are some settings we use for some very large GraphX jobs. These are based 
on using EC2 c3.8xl workers:


    .set("spark.sql.shuffle.partitions", "1024")
    .set("spark.sql.tungsten.enabled", "true")
    .set("spark.executor.memory", "24g")
    .set("spark.kryoserializer.buffer.max","1g")
    .set("spark.sql.codegen.wholeStage", "true")
    .set("spark.memory.offHeap.enabled", "true")
    .set("spark.memory.offHeap.size", "25769803776") // 24 GB

Some of these are in fact default configurations. Some are not.

Michael


> On Jul 8, 2016, at 9:01 AM, Michael Allman <mich...@videoamp.com> wrote:
> 
> Hi Adam,
> 
> From our experience we've found the default Spark 2.0 configuration to be 
> highly suboptimal. I don't know if this affects your benchmarks, but I would 
> consider running some tests with tuned and alternate configurations.
> 
> Michael
> 
> 
>> On Jul 8, 2016, at 8:58 AM, Adam Roberts <arobe...@uk.ibm.com 
>> <mailto:arobe...@uk.ibm.com>> wrote:
>> 
>> Hi Michael, the two Spark configuration files aren't very exciting 
>> 
>> spark-env.sh 
>> Same as the template apart from a JAVA_HOME setting 
>> 
>> spark-defaults.conf 
>> spark.io.compression.codec lzf 
>> 
>> config.py has the Spark home set, is running Spark standalone mode, we run 
>> and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory 
>> fraction, 100 trials 
>> 
>> We can post the 1.6.2 comparison early next week, running lots of iterations 
>> over the weekend once we get the dedicated time again 
>> 
>> Cheers, 
>> 
>> 
>> 
>> 
>> 
>> From:        Michael Allman <mich...@videoamp.com 
>> <mailto:mich...@videoamp.com>> 
>> To:        Adam Roberts/UK/IBM@IBMGB 
>> Cc:        dev <dev@spark.apache.org <mailto:dev@spark.apache.org>> 
>> Date:        08/07/2016 16:44 
>> Subject:        Re: Spark 2.0.0 performance; potential large Spark core 
>> regression 
>> 
>> 
>> 
>> Hi Adam, 
>> 
>> Do you have your spark confs and your spark-env.sh somewhere where we can 
>> see them? If not, can you make them available? 
>> 
>> Cheers, 
>> 
>> Michael 
>> 
>> On Jul 8, 2016, at 3:17 AM, Adam Roberts <arobe...@uk.ibm.com 
>> <mailto:arobe...@uk.ibm.com>> wrote: 
>> 
>> Hi, we've been testing the performance of Spark 2.0 compared to previous 
>> releases, unfortunately there are no Spark 2.0 compatible versions of 
>> HiBench and SparkPerf apart from those I'm working on (see 
>> https://github.com/databricks/spark-perf/issues/108 
>> <https://github.com/databricks/spark-perf/issues/108>) 
>> 
>> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean 
>> regression with a very small scale factor and so we've generated a couple of 
>> profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We 
>> will gather a 1.6.2 comparison and increase the scale factor. 
>> 
>> Has anybody noticed a similar problem? My changes for SparkPerf and Spark 
>> 2.0 are very limited and AFAIK don't interfere with Spark core 
>> functionality, so any feedback on the changes would be much appreciated and 
>> welcome, I'd much prefer it if my changes are the problem. 
>> 
>> A summary for your convenience follows (this matches what I've mentioned on 
>> the SparkPerf issue above) 
>> 
>> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
>> No. Of Workers: 1
>> Executor per Worker : 1
>> Executor Memory: 18G
>> Driver Memory : 8G
>> Serializer: kryo 
>> 
>> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: 
>> -Xdisableexplicitgc -Xcompressedrefs 
>> 
>> Main changes I made for the benchmark itself
>> Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
>> MLAlgorithmTests use Vectors.fromML
>> For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not 
>> wordStream.foreach
>> KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead 
>> of awaitTermination
>> Trivial: we use compact not compact.render for outputting json
>> 
>> In Spark 2.0 the top five methods where we spend our time is as follows, the 
>> percentage is how much of the overall processing time was spent in this 
>> particular method: 
>> 1.        AppendOnlyMap.changeValue 44% 
>> 2.        SortShuffleWriter.write 19% 
>> 3.        SizeTracker.estimateSize 7.5% 
>> 4.        SizeEstimator.estimate 5.36% 
>> 5.        Range.foreach 3.6% 
>> 
>> and in 1.5.2 the top five methods are: 
>> 1.        AppendOnlyMap.changeValue 38% 
>> 2.        ExternalSorter.insertAll 33% 
>> 3.        Range.foreach 4% 
>> 4.        SizeEstimator.estimate 2% 
>> 5.        SizeEstimator.visitSingleObject 2% 
>> 
>> I see the following scores, on the left I have the test name followed by the 
>> 1.5.2 time and then the 2.0.0 time
>> scheduling throughput: 5.2s vs 7.08s
>> agg by key; 0.72s vs 1.01s
>> agg by key int: 0.93s vs 1.19s
>> agg by key naive: 1.88s vs 2.02
>> sort by key: 0.64s vs 0.8s
>> sort by key int: 0.59s vs 0.64s
>> scala count: 0.09s vs 0.08s
>> scala count w fltr: 0.31s vs 0.47s 
>> 
>> This is only running the Spark core tests (scheduling throughput through 
>> scala-count-w-filtr, including all inbetween). 
>> 
>> Cheers, 
>> 
>> 
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number 
>> 741598. 
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 
>> 
>> 
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number 
>> 741598. 
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>

Re: Spark 2.0.0 performance; potential large Spark core regression

Reply via email to