Re: Spark 2.0.0 performance; potential large Spark core regression

Ted Yu Fri, 08 Jul 2016 09:27:17 -0700

bq. we turned it off when fixing a bug

Adam:
Can you refer to the bug JIRA ?


Thanks

On Fri, Jul 8, 2016 at 9:22 AM, Adam Roberts <[email protected]> wrote:

> Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned
> vs 2.0.0 default vs 1.6.2 default comparison, for future reference the
> defaults in Spark 2 RC2 look to be:
>
> sql.shuffle.partitions: 200
> Tungsten enabled: true
> Executor memory: 1 GB (we set to 18 GB)
> kryo buffer max: 64mb
> WholeStageCodegen: on I think, we turned it off when fixing a bug
> offHeap.enabled: false
> offHeap.size: 0
>
> Cheers,
>
>
>
>
> From:        Michael Allman <[email protected]>
> To:        Adam Roberts/UK/IBM@IBMGB
> Cc:        dev <[email protected]>
> Date:        08/07/2016 17:05
> Subject:        Re: Spark 2.0.0 performance; potential large Spark core
> regression
> ------------------------------
>
>
>
> Here are some settings we use for some very large GraphX jobs. These are
> based on using EC2 c3.8xl workers:
>
>     .set("spark.sql.shuffle.partitions", "1024")
>    .set("spark.sql.tungsten.enabled", "true")
>    .set("spark.executor.memory", "24g")
>    .set("spark.kryoserializer.buffer.max","1g")
>    .set("spark.sql.codegen.wholeStage", "true")
>    .set("spark.memory.offHeap.enabled", "true")
>    .set("spark.memory.offHeap.size", "25769803776") // 24 GB
>
> Some of these are in fact default configurations. Some are not.
>
> Michael
>
>
> On Jul 8, 2016, at 9:01 AM, Michael Allman <*[email protected]*
> <[email protected]>> wrote:
>
> Hi Adam,
>
> From our experience we've found the default Spark 2.0 configuration to be
> highly suboptimal. I don't know if this affects your benchmarks, but I
> would consider running some tests with tuned and alternate configurations.
>
> Michael
>
>
> On Jul 8, 2016, at 8:58 AM, Adam Roberts <*[email protected]*
> <[email protected]>> wrote:
>
> Hi Michael, the two Spark configuration files aren't very exciting
>
> * spark-env.sh*
> Same as the template apart from a JAVA_HOME setting
>
> * spark-defaults.conf*
> spark.io.compression.codec lzf
>
> * config.py* has the Spark home set, is running Spark standalone mode, we
> run and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66
> memory fraction, 100 trials
>
> We can post the 1.6.2 comparison early next week, running lots of
> iterations over the weekend once we get the dedicated time again
>
> Cheers,
>
>
>
>
>
> From:        Michael Allman <*[email protected]* <[email protected]>
> >
> To:        Adam Roberts/UK/IBM@IBMGB
> Cc:        dev <*[email protected]* <[email protected]>>
> Date:        08/07/2016 16:44
> Subject:        Re: Spark 2.0.0 performance; potential large Spark core
> regression
> ------------------------------
>
>
>
> Hi Adam,
>
> Do you have your spark confs and your spark-env.sh somewhere where we can
> see them? If not, can you make them available?
>
> Cheers,
>
> Michael
>
> On Jul 8, 2016, at 3:17 AM, Adam Roberts <*[email protected]*
> <[email protected]>> wrote:
>
> Hi, we've been testing the performance of Spark 2.0 compared to previous
> releases, unfortunately there are no Spark 2.0 compatible versions of
> HiBench and SparkPerf apart from those I'm working on (see
> *https://github.com/databricks/spark-perf/issues/108*
> <https://github.com/databricks/spark-perf/issues/108>)
>
> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean
> regression with a very small scale factor and so we've generated a couple
> of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform.
> We will gather a 1.6.2 comparison and increase the scale factor.
>
> Has anybody noticed a similar problem? My changes for SparkPerf and Spark
> 2.0 are very limited and AFAIK don't interfere with Spark core
> functionality, so any feedback on the changes would be much appreciated and
> welcome, I'd much prefer it if my changes are the problem.
>
> A summary for your convenience follows (this matches what I've mentioned
> on the SparkPerf issue above)
>
> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
> No. Of Workers: 1
> Executor per Worker : 1
> Executor Memory: 18G
> Driver Memory : 8G
> Serializer: kryo
>
> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options:
> -Xdisableexplicitgc -Xcompressedrefs
>
> Main changes I made for the benchmark itself
>
>    - Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
>    - MLAlgorithmTests use Vectors.fromML
>    - For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD
>    not wordStream.foreach
>    - KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext
>    instead of awaitTermination
>    - Trivial: we use compact not compact.render for outputting json
>
>
> In Spark 2.0 the top five methods where we spend our time is as follows,
> the percentage is how much of the overall processing time was spent in this
> particular method:
> 1.        AppendOnlyMap.changeValue 44%
> 2.        SortShuffleWriter.write 19%
> 3.        SizeTracker.estimateSize 7.5%
> 4.        SizeEstimator.estimate 5.36%
> 5.        Range.foreach 3.6%
>
> and in 1.5.2 the top five methods are:
> 1.        AppendOnlyMap.changeValue 38%
> 2.        ExternalSorter.insertAll 33%
> 3.        Range.foreach 4%
> 4.        SizeEstimator.estimate 2%
> 5.        SizeEstimator.visitSingleObject 2%
>
> I see the following scores, on the left I have the test name followed by
> the 1.5.2 time and then the 2.0.0 time
> scheduling throughput: 5.2s vs 7.08s
> agg by key; 0.72s vs 1.01s
> agg by key int: 0.93s vs 1.19s
> agg by key naive: 1.88s vs 2.02
> sort by key: 0.64s vs 0.8s
> sort by key int: 0.59s vs 0.64s
> scala count: 0.09s vs 0.08s
> scala count w fltr: 0.31s vs 0.47s
>
> This is only running the Spark core tests (scheduling throughput through
> scala-count-w-filtr, including all inbetween).
>
> Cheers,
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>

Re: Spark 2.0.0 performance; potential large Spark core regression

Reply via email to