Here are some settings we use for some very large GraphX jobs. These are based on using EC2 c3.8xl workers:
.set("spark.sql.shuffle.partitions", "1024") .set("spark.sql.tungsten.enabled", "true") .set("spark.executor.memory", "24g") .set("spark.kryoserializer.buffer.max","1g") .set("spark.sql.codegen.wholeStage", "true") .set("spark.memory.offHeap.enabled", "true") .set("spark.memory.offHeap.size", "25769803776") // 24 GB Some of these are in fact default configurations. Some are not. Michael > On Jul 8, 2016, at 9:01 AM, Michael Allman <mich...@videoamp.com> wrote: > > Hi Adam, > > From our experience we've found the default Spark 2.0 configuration to be > highly suboptimal. I don't know if this affects your benchmarks, but I would > consider running some tests with tuned and alternate configurations. > > Michael > > >> On Jul 8, 2016, at 8:58 AM, Adam Roberts <arobe...@uk.ibm.com >> <mailto:arobe...@uk.ibm.com>> wrote: >> >> Hi Michael, the two Spark configuration files aren't very exciting >> >> spark-env.sh >> Same as the template apart from a JAVA_HOME setting >> >> spark-defaults.conf >> spark.io.compression.codec lzf >> >> config.py has the Spark home set, is running Spark standalone mode, we run >> and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory >> fraction, 100 trials >> >> We can post the 1.6.2 comparison early next week, running lots of iterations >> over the weekend once we get the dedicated time again >> >> Cheers, >> >> >> >> >> >> From: Michael Allman <mich...@videoamp.com >> <mailto:mich...@videoamp.com>> >> To: Adam Roberts/UK/IBM@IBMGB >> Cc: dev <dev@spark.apache.org <mailto:dev@spark.apache.org>> >> Date: 08/07/2016 16:44 >> Subject: Re: Spark 2.0.0 performance; potential large Spark core >> regression >> >> >> >> Hi Adam, >> >> Do you have your spark confs and your spark-env.sh somewhere where we can >> see them? If not, can you make them available? >> >> Cheers, >> >> Michael >> >> On Jul 8, 2016, at 3:17 AM, Adam Roberts <arobe...@uk.ibm.com >> <mailto:arobe...@uk.ibm.com>> wrote: >> >> Hi, we've been testing the performance of Spark 2.0 compared to previous >> releases, unfortunately there are no Spark 2.0 compatible versions of >> HiBench and SparkPerf apart from those I'm working on (see >> https://github.com/databricks/spark-perf/issues/108 >> <https://github.com/databricks/spark-perf/issues/108>) >> >> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean >> regression with a very small scale factor and so we've generated a couple of >> profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We >> will gather a 1.6.2 comparison and increase the scale factor. >> >> Has anybody noticed a similar problem? My changes for SparkPerf and Spark >> 2.0 are very limited and AFAIK don't interfere with Spark core >> functionality, so any feedback on the changes would be much appreciated and >> welcome, I'd much prefer it if my changes are the problem. >> >> A summary for your convenience follows (this matches what I've mentioned on >> the SparkPerf issue above) >> >> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05 >> No. Of Workers: 1 >> Executor per Worker : 1 >> Executor Memory: 18G >> Driver Memory : 8G >> Serializer: kryo >> >> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: >> -Xdisableexplicitgc -Xcompressedrefs >> >> Main changes I made for the benchmark itself >> Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem >> MLAlgorithmTests use Vectors.fromML >> For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not >> wordStream.foreach >> KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead >> of awaitTermination >> Trivial: we use compact not compact.render for outputting json >> >> In Spark 2.0 the top five methods where we spend our time is as follows, the >> percentage is how much of the overall processing time was spent in this >> particular method: >> 1. AppendOnlyMap.changeValue 44% >> 2. SortShuffleWriter.write 19% >> 3. SizeTracker.estimateSize 7.5% >> 4. SizeEstimator.estimate 5.36% >> 5. Range.foreach 3.6% >> >> and in 1.5.2 the top five methods are: >> 1. AppendOnlyMap.changeValue 38% >> 2. ExternalSorter.insertAll 33% >> 3. Range.foreach 4% >> 4. SizeEstimator.estimate 2% >> 5. SizeEstimator.visitSingleObject 2% >> >> I see the following scores, on the left I have the test name followed by the >> 1.5.2 time and then the 2.0.0 time >> scheduling throughput: 5.2s vs 7.08s >> agg by key; 0.72s vs 1.01s >> agg by key int: 0.93s vs 1.19s >> agg by key naive: 1.88s vs 2.02 >> sort by key: 0.64s vs 0.8s >> sort by key int: 0.59s vs 0.64s >> scala count: 0.09s vs 0.08s >> scala count w fltr: 0.31s vs 0.47s >> >> This is only running the Spark core tests (scheduling throughput through >> scala-count-w-filtr, including all inbetween). >> >> Cheers, >> >> >> Unless stated otherwise above: >> IBM United Kingdom Limited - Registered in England and Wales with number >> 741598. >> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU >> >> >> Unless stated otherwise above: >> IBM United Kingdom Limited - Registered in England and Wales with number >> 741598. >> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU >