Here are some settings we use for some very large GraphX jobs. These are based
on using EC2 c3.8xl workers:
.set("spark.sql.shuffle.partitions", "1024")
.set("spark.sql.tungsten.enabled", "true")
.set("spark.executor.memory", "24g")
.set("spark.kryoserializer.buffer.max","1g")
.set("spark.sql.codegen.wholeStage", "true")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "25769803776") // 24 GB
Some of these are in fact default configurations. Some are not.
Michael
> On Jul 8, 2016, at 9:01 AM, Michael Allman <[email protected]> wrote:
>
> Hi Adam,
>
> From our experience we've found the default Spark 2.0 configuration to be
> highly suboptimal. I don't know if this affects your benchmarks, but I would
> consider running some tests with tuned and alternate configurations.
>
> Michael
>
>
>> On Jul 8, 2016, at 8:58 AM, Adam Roberts <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Hi Michael, the two Spark configuration files aren't very exciting
>>
>> spark-env.sh
>> Same as the template apart from a JAVA_HOME setting
>>
>> spark-defaults.conf
>> spark.io.compression.codec lzf
>>
>> config.py has the Spark home set, is running Spark standalone mode, we run
>> and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory
>> fraction, 100 trials
>>
>> We can post the 1.6.2 comparison early next week, running lots of iterations
>> over the weekend once we get the dedicated time again
>>
>> Cheers,
>>
>>
>>
>>
>>
>> From: Michael Allman <[email protected]
>> <mailto:[email protected]>>
>> To: Adam Roberts/UK/IBM@IBMGB
>> Cc: dev <[email protected] <mailto:[email protected]>>
>> Date: 08/07/2016 16:44
>> Subject: Re: Spark 2.0.0 performance; potential large Spark core
>> regression
>>
>>
>>
>> Hi Adam,
>>
>> Do you have your spark confs and your spark-env.sh somewhere where we can
>> see them? If not, can you make them available?
>>
>> Cheers,
>>
>> Michael
>>
>> On Jul 8, 2016, at 3:17 AM, Adam Roberts <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Hi, we've been testing the performance of Spark 2.0 compared to previous
>> releases, unfortunately there are no Spark 2.0 compatible versions of
>> HiBench and SparkPerf apart from those I'm working on (see
>> https://github.com/databricks/spark-perf/issues/108
>> <https://github.com/databricks/spark-perf/issues/108>)
>>
>> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean
>> regression with a very small scale factor and so we've generated a couple of
>> profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We
>> will gather a 1.6.2 comparison and increase the scale factor.
>>
>> Has anybody noticed a similar problem? My changes for SparkPerf and Spark
>> 2.0 are very limited and AFAIK don't interfere with Spark core
>> functionality, so any feedback on the changes would be much appreciated and
>> welcome, I'd much prefer it if my changes are the problem.
>>
>> A summary for your convenience follows (this matches what I've mentioned on
>> the SparkPerf issue above)
>>
>> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
>> No. Of Workers: 1
>> Executor per Worker : 1
>> Executor Memory: 18G
>> Driver Memory : 8G
>> Serializer: kryo
>>
>> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options:
>> -Xdisableexplicitgc -Xcompressedrefs
>>
>> Main changes I made for the benchmark itself
>> Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
>> MLAlgorithmTests use Vectors.fromML
>> For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not
>> wordStream.foreach
>> KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead
>> of awaitTermination
>> Trivial: we use compact not compact.render for outputting json
>>
>> In Spark 2.0 the top five methods where we spend our time is as follows, the
>> percentage is how much of the overall processing time was spent in this
>> particular method:
>> 1. AppendOnlyMap.changeValue 44%
>> 2. SortShuffleWriter.write 19%
>> 3. SizeTracker.estimateSize 7.5%
>> 4. SizeEstimator.estimate 5.36%
>> 5. Range.foreach 3.6%
>>
>> and in 1.5.2 the top five methods are:
>> 1. AppendOnlyMap.changeValue 38%
>> 2. ExternalSorter.insertAll 33%
>> 3. Range.foreach 4%
>> 4. SizeEstimator.estimate 2%
>> 5. SizeEstimator.visitSingleObject 2%
>>
>> I see the following scores, on the left I have the test name followed by the
>> 1.5.2 time and then the 2.0.0 time
>> scheduling throughput: 5.2s vs 7.08s
>> agg by key; 0.72s vs 1.01s
>> agg by key int: 0.93s vs 1.19s
>> agg by key naive: 1.88s vs 2.02
>> sort by key: 0.64s vs 0.8s
>> sort by key int: 0.59s vs 0.64s
>> scala count: 0.09s vs 0.08s
>> scala count w fltr: 0.31s vs 0.47s
>>
>> This is only running the Spark core tests (scheduling throughput through
>> scala-count-w-filtr, including all inbetween).
>>
>> Cheers,
>>
>>
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number
>> 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>>
>>
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number
>> 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>