Re: [VOTE] Apache Spark 2.1.1 (RC3)
+1 (non-binding), looks good Tested on RHEL 7.2, 7.3, CentOS 7.2, Ubuntu 14 04 and 16 04, SUSE 12, x86, IBM Linux on Power and IBM Linux on Z (big-endian) No problems with latest IBM Java, Hadoop 2.7.3 and Scala 2.11.8, no performance concerns to report either (spark-sql-perf and HiBench) Built with mvn -T 1C -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package Tested with mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test From: Felix Cheung To: Reynold Xin , Dong Joon Hyun , Marcelo Vanzin , Denny Lee Cc: Michael Armbrust , "dev@spark.apache.org" Date: 20/04/2017 08:13 Subject:Re: [VOTE] Apache Spark 2.1.1 (RC3) Tested on both Linux and Windows, as package. Found StackOverflowError with ALS on Windows https://issues.apache.org/jira/browse/SPARK-20402 This is part of the R CRAN check to build the vignettes. Very simple, quick and consistent repo on Windows. The exact same code works fine on Linux. Reproduces the same error with 2.1.1 RC2 but didn't see before because was blocked by a different issue then. 2.1.0 release didn't have ALS R API. Will convert code to Scala to check and investigate further. I'm not sure if we would consider this a blocker, but it might block R package release to CRAN. _ From: Denny Lee Sent: Wednesday, April 19, 2017 11:00 PM Subject: Re: [VOTE] Apache Spark 2.1.1 (RC3) To: Reynold Xin , Dong Joon Hyun < dh...@hortonworks.com>, Marcelo Vanzin Cc: Michael Armbrust , +1 (non-binding) On Wed, Apr 19, 2017 at 9:23 PM Dong Joon Hyun wrote: +1 I tested RC3 on CentOS 7.3.1611/OpenJDK 1.8.0_121/R 3.3.3 with `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver –Psparkr` At the end of R test, I saw `Had CRAN check errors; see logs.`, but tests passed and log file looks good. Bests, Dongjoon. From: Reynold Xin Date: Wednesday, April 19, 2017 at 3:41 PM To: Marcelo Vanzin Cc: Michael Armbrust , "dev@spark.apache.org" < dev@spark.apache.org> Subject: Re: [VOTE] Apache Spark 2.1.1 (RC3) +1 On Wed, Apr 19, 2017 at 3:31 PM, Marcelo Vanzin wrote: +1 (non-binding). Ran the hadoop-2.6 binary against our internal tests and things look good. On Tue, Apr 18, 2017 at 11:59 AM, Michael Armbrust wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.1.1. The vote is open until Fri, April 21st, 2018 at 13:00 PST and passes > if a majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 2.1.1 > [ ] -1 Do not release this package because ... > > > To learn more about Apache Spark, please see http://spark.apache.org/ > > The tag to be voted on is v2.1.1-rc3 > (2ed19cff2f6ab79a718526e5d16633412d8c4dd4) > > List of JIRA tickets resolved can be found with this filter. > > The release files, including signatures, digests, etc. can be found at: > http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1230/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/ > > > FAQ > > How can I help test this release? > > If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions. > > What should happen to JIRA tickets still targeting 2.1.1? > > Committers should look at those and triage. Extremely important bug fixes, > documentation, and API tweaks that impact compatibility should be worked on > immediately. Everything else please retarget to 2.1.2 or 2.2.0. > > But my bug isn't fixed!??! > > In order to make timely releases, we will typically not hold the release > unless the bug in question is a regression from 2.1.0. > > What happened to RC1? > > There were issues with the release packaging and as a result was skipped. -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark performance tests
Hi, I suggest HiBench and SparkSqlPerf, HiBench features many benchmarks within it that exercise several components of Spark (great for stressing core, sql, MLlib capabilities), SparkSqlPerf features 99 TPC-DS queries (stressing the DataFrame API and therefore the Catalyst optimiser), both work well with Spark 2 HiBench: https://github.com/intel-hadoop/HiBench SparkSqlPerf: https://github.com/databricks/spark-sql-perf From: "Kazuaki Ishizaki" To: Prasun Ratn Cc: Apache Spark Dev Date: 10/01/2017 09:22 Subject:Re: Spark performance tests Hi, You may find several micro-benchmarks under https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark . Regards, Kazuaki Ishizaki From:Prasun Ratn To:Apache Spark Dev Date:2017/01/10 12:52 Subject:Spark performance tests Hi Are there performance tests or microbenchmarks for Spark - especially directed towards the CPU specific parts? I looked at spark-perf but that doesn't seem to have been updated recently. Thanks Prasun - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: [VOTE] Apache Spark 2.1.0 (RC5)
+1 (non-binding) Functional: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's latest SDK for Java (8 SR3 FP21). Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM specific platforms including big-endian. On slower machines I see these failing but nothing to be concerned over (timeouts): org.apache.spark.DistributedSuite.caching on disk org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by current_time, complete mode org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by current_date, complete mode org.apache.spark.sql.hive.HiveSparkSubmitSuite.set hive.metastore.warehouse.dir Performance vs 2.0.2: lots of improvements seen using the HiBench and SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the Kryo serializer, controlled test environment. These are all open source benchmarks anyone can use and experiment with. Elapsed times measured, + scores are an improvement (so it's that much percent faster) and - scores are used for regressions I'm seeing. K-means: Java API +22% (100 sec to 78 sec), Scala API +30% (34 seconds to 24 seconds), Python API unchanged PageRank: minor improvement from 40 seconds to 38 seconds, +5% Sort: minor improvement, 10.8 seconds to 9.8 seconds, +10% WordCount: unchanged Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is -47%, other times marginally faster by 15%, something to keep an eye on Terasort: +18% (39 seconds to 32 seconds) with the Java/Scala APIs For TPC-DS SQL queries the results are a mixed bag again, I see > 10% boosts for q9, q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52, q57, q89. Five iterations, average times compared, only changing which version of Spark we're using From: Holden Karau To: Denny Lee , Liwei Lin , "dev@spark.apache.org" Date: 18/12/2016 20:05 Subject:Re: [VOTE] Apache Spark 2.1.0 (RC5) +1 (non-binding) - checked Python artifacts with virtual env. On Sun, Dec 18, 2016 at 11:42 AM Denny Lee wrote: +1 (non-binding) On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin wrote: +1 Cheers, Liwei On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang wrote: I hope https://github.com/apache/spark/pull/16252 can be fixed until release 2.1.0. It's a fix for broadcast cannot fit in memory. On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley wrote: +1 On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: +1 On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li wrote: +1 Xiao Li 2016-12-16 12:19 GMT-08:00 Felix Cheung : For R we have a license field in the DESCRIPTION, and this is standard practice (and requirement) for R packages. https://cran.r-project.org/doc/manuals/R-exts.html#Licensing From: Sean Owen Sent: Friday, December 16, 2016 9:57:15 AM To: Reynold Xin; dev@spark.apache.org Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5) (If you have a template for these emails, maybe update it to use https links. They work for apache.org domains. After all we are asking people to verify the integrity of release artifacts, so it might as well be secure.) (Also the new archives use .tar.gz instead of .tgz like the others. No big deal, my OCD eye just noticed it.) I don't see an Apache license / notice for the Pyspark or SparkR artifacts. It would be good practice to include this in a convenience binary. I'm not sure if it's strictly mandatory, but something to adjust in any event. I think that's all there is to do for SparkR. For Pyspark, which packages a bunch of dependencies, it does include the licenses (good) but I think it should include the NOTICE file. This is the first time I recall getting 0 test failures off the bat! I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles. I think I'd +1 this therefore unless someone knows that the license issue above is real and a blocker. On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin wrote: Please vote on releasing the following candidate as Apache Spark version 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v2.1.0-rc5 (cd0a08361e2526519e7c131c42116bf56fa62c76) List of JIRA tickets resolved are: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0 The release files, including signatures, digests, etc. can be found at: http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/ Release artifacts are signed with the following key: https://people.apach
Re: [VOTE] Apache Spark 2.1.0 (RC2)
I've never seen the ReplSuite test OoMing with IBM's latest SDK for Java but have always noticed this particular test failing with the following instead: java.lang.AssertionError: assertion failed: deviation too large: 0.8506807397223823, first size: 180392, second size: 333848 This particular test could be improved and I don't think it should hold up releases, I've commented on [SPARK-14558] already a while back and the discussion ended with: "A better check would be to run with and without the closure cleaner change -> Yea, this is what I did locally, but how to write a test for it?" It will fail in this particular way reliably with Open/Oracle JDK as well if you were to use Kryo. We don't see this test failing (either OoM or the above failure) with OpenJDK 8 in our test farm, this is with OpenJDK 1.8.0_51-b16 and I'm running with -Xmx4g -Xss2048k -Dspark.buffer.pageSize 1048576. All other Spark unit tests pass (we see a grand total of 11980 tests) except for the Kafka stress test already mentioned, various platforms/operating systems including big-endian. I've never seen the NoSuchMethod error mentioned in JavaUDFSuite and haven't seen the failure Alan mentions below either. I also have performance data to share (HiBench and SparkSqlPerf with TPC-DS queries) comparing this release to Spark 2.0.2, I'll wait until the next RC before commenting (it is positive), looks like we'll have another as this RC2 vote should be closed by now and in RC3 we'd also have the [SPARK-18091] fix included to prevent a test's generated code exceeding the 64k constant pool size limit. From: akchin To: dev@spark.apache.org Date: 13/12/2016 19:51 Subject:Re: [VOTE] Apache Spark 2.1.0 (RC2) Hello, I am seeing this error as well except during "define case class and create Dataset together with paste mode *** FAILED ***" Starts throwing OOM and GC errors after running for several minutes. - Alan Chin IBM Spark Technology Center -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-1-0-RC2-tp20175p20215.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: [VOTE] Release Apache Spark 2.0.2 (RC3)
+1 (non-binding) Build: mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package Test: mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test Test options: -Xss2048k -Dspark.buffer.pageSize=1048576 -Xmx4g No problems with OpenJDK 8 on x86. No problems with the latest IBM Java 8, various architectures including big and little-endian, various operating systems including RHEL 72, CentOS 72, Ubuntu 14 04, Ubuntu 16 04, SUSE 12. No issues with the Python tests. No performance concerns with HiBench large. From: Mingjie Tang To: Tathagata Das Cc: Kousuke Saruta , Reynold Xin , dev Date: 11/11/2016 03:44 Subject:Re: [VOTE] Release Apache Spark 2.0.2 (RC3) +1 (non-binding) On Thu, Nov 10, 2016 at 6:06 PM, Tathagata Das < tathagata.das1...@gmail.com> wrote: +1 binding On Thu, Nov 10, 2016 at 6:05 PM, Kousuke Saruta wrote: +1 (non-binding) On 2016年11月08日 15:09, Reynold Xin wrote: Please vote on releasing the following candidate as Apache Spark version 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.2 [ ] -1 Do not release this package because ... The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367ba694b0c34) This release candidate resolves 84 issues: https://s.apache.org/spark-2.0.2-jira The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ < http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-bin/> Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1214/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ < http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-docs/> Q: How can I help test this release? A: If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 2.0.1. Q: What justifies a -1 vote for this release? A: This is a maintenance release in the 2.0.x series. Bugs already present in 2.0.1, missing features, or bugs related to new features will not necessarily block this release. Q: What fix version should I use for patches merging into branch-2.0 from now on? A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: [VOTE] Release Apache Spark 2.0.2 (RC2)
I'm seeing the same failure but manifesting itself as a stackoverflow, various operating systems and architectures (RHEL 71, CentOS 72, SUSE 12, Ubuntu 14 04 and 16 04 LTS) Build and test options: mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test -Xss2048k -Dspark.buffer.pageSize=1048576 -Xmx4g Stacktrace (this is with IBM's latest SDK for Java 8): scala> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError: operating system stack overflow at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:849) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:188) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:833) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:830) at org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:137) ... omitted the rest for brevity Would also be useful to include this small but useful change that looks to have only just missed the cut: https://github.com/apache/spark/pull/14409 From: Reynold Xin To: Dongjoon Hyun Cc: "dev@spark.apache.org" Date: 02/11/2016 18:37 Subject:Re: [VOTE] Release Apache Spark 2.0.2 (RC2) Looks like there is an issue with Maven (likely just the test itself though). We should look into it. On Wed, Nov 2, 2016 at 11:32 AM, Dongjoon Hyun wrote: Hi, Sean. The same failure blocks me, too. - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** FAILED *** I used `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Dsparkr` on CentOS 7 / OpenJDK1.8.0_111. Dongjoon. On 2016-11-02 10:44 (-0700), Sean Owen wrote: > Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here are > the 4 issues still open: > > SPARK-14387 Enable Hive-1.x ORC compatibility with > spark.sql.hive.convertMetastoreOrc > SPARK-17957 Calling outer join and na.fill(0) and then inner join will miss > rows > SPARK-17981 Incorrectly Set Nullability to False in FilterExec > SPARK-18160 spark.files & spark.jars should not be passed to driver in yarn > mode > > Running with Java 8, -Pyarn -Phive -Phive-thriftserver -Phadoop-2.7 on > Ubuntu 16, I am seeing consistent failures in this test below. I think we > very recently changed this so it could be legitimate. But does anyone else > see something like this? I have seen other failures in this test due to OOM > but my MAVEN_OPTS allows 6g of heap, which ought to be plenty. > > > - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** FAILED > *** > isContain was true Interpreter output contained 'Exception': > Welcome to > __ >/ __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.0.2 > /_/ > > Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_102) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> > scala> keyValueGrouped: > org.apache.spark.sql.KeyValueGroupedDataset[Int,(Int, Int)] = > org.apache.spark.sql.KeyValueGroupedDataset@70c30f72 > > scala> mapGroups: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, > _2: int] > > scala> broadcasted: org.apache.spark.broadcast.Broadcast[Int] = > Broadcast(0) > > scala> > scala> > scala> dataset: org.apache.spark.sql.Dataset[Int] = [value: int] > > scala> org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in > stage 0.0 (TID 0, localhost): > com.google.common.util.concurrent.ExecutionError: > java.lang.ClassCircularityError: > io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache
Re: Spark 2.0.0 performance; potential large Spark core regression
Ted, That bug was https://issues.apache.org/jira/browse/SPARK-15822 and only found as part of running an sql-flights application (not with the unit tests), I don't know if this has anything to do with the regression we're seeing One update is that we see the same ballpark regression for 1.6.2 vs 2.0 with HiBench (large profile, 25g executor memory, 4g driver), again we will be carefully checking how these benchmarks are being run and what difference the options and configurations can make Cheers, From: Ted Yu To: Adam Roberts/UK/IBM@IBMGB Cc: Michael Allman , dev Date: 08/07/2016 17:26 Subject:Re: Spark 2.0.0 performance; potential large Spark core regression bq. we turned it off when fixing a bug Adam: Can you refer to the bug JIRA ? Thanks On Fri, Jul 8, 2016 at 9:22 AM, Adam Roberts wrote: Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned vs 2.0.0 default vs 1.6.2 default comparison, for future reference the defaults in Spark 2 RC2 look to be: sql.shuffle.partitions: 200 Tungsten enabled: true Executor memory: 1 GB (we set to 18 GB) kryo buffer max: 64mb WholeStageCodegen: on I think, we turned it off when fixing a bug offHeap.enabled: false offHeap.size: 0 Cheers, From:Michael Allman To: Adam Roberts/UK/IBM@IBMGB Cc:dev Date:08/07/2016 17:05 Subject:Re: Spark 2.0.0 performance; potential large Spark core regression Here are some settings we use for some very large GraphX jobs. These are based on using EC2 c3.8xl workers: .set("spark.sql.shuffle.partitions", "1024") .set("spark.sql.tungsten.enabled", "true") .set("spark.executor.memory", "24g") .set("spark.kryoserializer.buffer.max","1g") .set("spark.sql.codegen.wholeStage", "true") .set("spark.memory.offHeap.enabled", "true") .set("spark.memory.offHeap.size", "25769803776") // 24 GB Some of these are in fact default configurations. Some are not. Michael On Jul 8, 2016, at 9:01 AM, Michael Allman wrote: Hi Adam, >From our experience we've found the default Spark 2.0 configuration to be highly suboptimal. I don't know if this affects your benchmarks, but I would consider running some tests with tuned and alternate configurations. Michael On Jul 8, 2016, at 8:58 AM, Adam Roberts wrote: Hi Michael, the two Spark configuration files aren't very exciting spark-env.sh Same as the template apart from a JAVA_HOME setting spark-defaults.conf spark.io.compression.codec lzf config.py has the Spark home set, is running Spark standalone mode, we run and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory fraction, 100 trials We can post the 1.6.2 comparison early next week, running lots of iterations over the weekend once we get the dedicated time again Cheers, From:Michael Allman To:Adam Roberts/UK/IBM@IBMGB Cc:dev Date:08/07/2016 16:44 Subject:Re: Spark 2.0.0 performance; potential large Spark core regression Hi Adam, Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available? Cheers, Michael On Jul 8, 2016, at 3:17 AM, Adam Roberts wrote: Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108) With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean regression with a very small scale factor and so we've generated a couple of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We will gather a 1.6.2 comparison and increase the scale factor. Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 are very limited and AFAIK don't interfere with Spark core functionality, so any feedback on the changes would be much appreciated and welcome, I'd much prefer it if my changes are the problem. A summary for your convenience follows (this matches what I've mentioned on the SparkPerf issue above) 1. spark-perf/config/config.py : SCALE_FACTOR=0.05 No. Of Workers: 1 Executor per Worker : 1 Executor Memory: 18G Driver Memory : 8G Serializer: kryo 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs Main changes I made for the benchmark itself Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem MLAlgorithmTests use Vectors.fromML For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermi
Re: Spark performance regression test suite
Agreed, this is something that we do regularly when producing our own Spark distributions in IBM and so it will be beneficial to share updates with the wider community, so far it looks like Spark 1.6.2 is the best out of the box on spark-perf and HiBench (of course this may vary for real workloads, individual applications and tuning efforts) but we have more 2.0 tests to be performed and we're not aware of any regressions between previous versions except for perhaps with the Spark 2.0.0 post I made. I'm looking for testing and feedback from any Spark gurus with my 2.0 changes for spark-perf (have a look at the open issue Holden's mentioned: https://github.com/databricks/spark-perf/issues/108) and the same goes for HiBench (FWIW we see the same regression on HiBench too: https://github.com/intel-hadoop/HiBench/issues/221). One idea for us is that the benchmarking could be run optionally as part of the existing contribution process, an ideal solution IMO would involve an additional parameter for the Jenkins job that when ticked will result in a performance run being done with and without the change. As we don't have direct access to the Jenkins build button in the community, when contributing a change users could mark their change with something like @performance or "jenkins performance test this please". Alternatively the influential Spark folk could notice a change with a potential performance impact and have it tested accordingly. While microbenchmarks are useful it will be important to test the whole of Spark. Then there's also the use of tags in the JIRA - lots for us to work with if we wanted this. This probably means the addition and therefore maintenance of dedicated machines in the build farm although this would highlight any regressions FAST as opposed to later on in the development cycle. If there is indeed a regression we may have the fun task of binary chopping commits between 1.6.2 and now...again TBC but a real possibility, so interested to see if anybody else is doing regression testing and if they see a similar problem. If we don't go down the "benchmark as you contribute" route, having such a suite will be perfect - it would clone the latest versions of each benchmark, build them for the current version of Spark (can identify the release from the pom), run the benchmarks we care about (let's say in Spark standalone mode with a couple of executors) and produce a geomean score - highlighting any significant deviations. I'm happy to help with designing/reviewing this Cheers, From: Michael Gummelt To: Eric Liang Cc: Holden Karau , Ted Yu , Michael Allman , dev Date: 11/07/2016 17:00 Subject:Re: Spark performance regression test suite I second any effort to update, automate, and communicate the results of spark-perf (https://github.com/databricks/spark-perf) On Fri, Jul 8, 2016 at 12:28 PM, Eric Liang wrote: Something like speed.pypy.org or the Chrome performance dashboards would be very useful. On Fri, Jul 8, 2016 at 9:50 AM Holden Karau wrote: There are also the spark-perf and spark-sql-perf projects in the Databricks github (although I see an open issue for Spark 2.0 support in one of them). On Friday, July 8, 2016, Ted Yu wrote: Found a few issues: [SPARK-6810] Performance benchmarks for SparkR [SPARK-2833] performance tests for linear regression [SPARK-15447] Performance test for ALS in Spark 2.0 Haven't found one for Spark core. On Fri, Jul 8, 2016 at 8:58 AM, Michael Allman wrote: Hello, I've seen a few messages on the mailing list regarding Spark performance concerns, especially regressions from previous versions. It got me thinking that perhaps an automated performance regression suite would be a worthwhile contribution? Is anyone working on this? Do we have a Jira issue for it? I cannot commit to taking charge of such a project. I just thought it would be a great contribution for someone who does have the time and the chops to build it. Cheers, Michael - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau -- Michael Gummelt Software Engineer Mesosphere Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark 2.0.0 performance; potential large Spark core regression
Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned vs 2.0.0 default vs 1.6.2 default comparison, for future reference the defaults in Spark 2 RC2 look to be: sql.shuffle.partitions: 200 Tungsten enabled: true Executor memory: 1 GB (we set to 18 GB) kryo buffer max: 64mb WholeStageCodegen: on I think, we turned it off when fixing a bug offHeap.enabled: false offHeap.size: 0 Cheers, From: Michael Allman To: Adam Roberts/UK/IBM@IBMGB Cc: dev Date: 08/07/2016 17:05 Subject:Re: Spark 2.0.0 performance; potential large Spark core regression Here are some settings we use for some very large GraphX jobs. These are based on using EC2 c3.8xl workers: .set("spark.sql.shuffle.partitions", "1024") .set("spark.sql.tungsten.enabled", "true") .set("spark.executor.memory", "24g") .set("spark.kryoserializer.buffer.max","1g") .set("spark.sql.codegen.wholeStage", "true") .set("spark.memory.offHeap.enabled", "true") .set("spark.memory.offHeap.size", "25769803776") // 24 GB Some of these are in fact default configurations. Some are not. Michael On Jul 8, 2016, at 9:01 AM, Michael Allman wrote: Hi Adam, >From our experience we've found the default Spark 2.0 configuration to be highly suboptimal. I don't know if this affects your benchmarks, but I would consider running some tests with tuned and alternate configurations. Michael On Jul 8, 2016, at 8:58 AM, Adam Roberts wrote: Hi Michael, the two Spark configuration files aren't very exciting spark-env.sh Same as the template apart from a JAVA_HOME setting spark-defaults.conf spark.io.compression.codec lzf config.py has the Spark home set, is running Spark standalone mode, we run and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory fraction, 100 trials We can post the 1.6.2 comparison early next week, running lots of iterations over the weekend once we get the dedicated time again Cheers, From:Michael Allman To:Adam Roberts/UK/IBM@IBMGB Cc:dev Date:08/07/2016 16:44 Subject:Re: Spark 2.0.0 performance; potential large Spark core regression Hi Adam, Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available? Cheers, Michael On Jul 8, 2016, at 3:17 AM, Adam Roberts wrote: Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108) With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean regression with a very small scale factor and so we've generated a couple of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We will gather a 1.6.2 comparison and increase the scale factor. Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 are very limited and AFAIK don't interfere with Spark core functionality, so any feedback on the changes would be much appreciated and welcome, I'd much prefer it if my changes are the problem. A summary for your convenience follows (this matches what I've mentioned on the SparkPerf issue above) 1. spark-perf/config/config.py : SCALE_FACTOR=0.05 No. Of Workers: 1 Executor per Worker : 1 Executor Memory: 18G Driver Memory : 8G Serializer: kryo 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs Main changes I made for the benchmark itself Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem MLAlgorithmTests use Vectors.fromML For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination Trivial: we use compact not compact.render for outputting json In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method: 1.AppendOnlyMap.changeValue 44% 2.SortShuffleWriter.write 19% 3.SizeTracker.estimateSize 7.5% 4.SizeEstimator.estimate 5.36% 5.Range.foreach 3.6% and in 1.5.2 the top five methods are: 1.AppendOnlyMap.changeValue 38% 2.ExternalSorter.insertAll 33% 3.Range.foreach 4% 4.SizeEstimator.estimate 2% 5.SizeEstimator.visitSingleObject 2% I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time scheduling throughput: 5.2s vs 7.08s agg by key; 0.72s vs 1.01s agg by key int: 0.93s vs 1.19s agg
Re: Spark 2.0.0 performance; potential large Spark core regression
Hi Michael, the two Spark configuration files aren't very exciting spark-env.sh Same as the template apart from a JAVA_HOME setting spark-defaults.conf spark.io.compression.codec lzf config.py has the Spark home set, is running Spark standalone mode, we run and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory fraction, 100 trials We can post the 1.6.2 comparison early next week, running lots of iterations over the weekend once we get the dedicated time again Cheers, From: Michael Allman To: Adam Roberts/UK/IBM@IBMGB Cc: dev Date: 08/07/2016 16:44 Subject:Re: Spark 2.0.0 performance; potential large Spark core regression Hi Adam, Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available? Cheers, Michael On Jul 8, 2016, at 3:17 AM, Adam Roberts wrote: Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108) With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean regression with a very small scale factor and so we've generated a couple of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We will gather a 1.6.2 comparison and increase the scale factor. Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 are very limited and AFAIK don't interfere with Spark core functionality, so any feedback on the changes would be much appreciated and welcome, I'd much prefer it if my changes are the problem. A summary for your convenience follows (this matches what I've mentioned on the SparkPerf issue above) 1. spark-perf/config/config.py : SCALE_FACTOR=0.05 No. Of Workers: 1 Executor per Worker : 1 Executor Memory: 18G Driver Memory : 8G Serializer: kryo 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs Main changes I made for the benchmark itself Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem MLAlgorithmTests use Vectors.fromML For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination Trivial: we use compact not compact.render for outputting json In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method: 1.AppendOnlyMap.changeValue 44% 2.SortShuffleWriter.write 19% 3.SizeTracker.estimateSize 7.5% 4.SizeEstimator.estimate 5.36% 5.Range.foreach 3.6% and in 1.5.2 the top five methods are: 1.AppendOnlyMap.changeValue 38% 2.ExternalSorter.insertAll 33% 3.Range.foreach 4% 4.SizeEstimator.estimate 2% 5.SizeEstimator.visitSingleObject 2% I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time scheduling throughput: 5.2s vs 7.08s agg by key; 0.72s vs 1.01s agg by key int: 0.93s vs 1.19s agg by key naive: 1.88s vs 2.02 sort by key: 0.64s vs 0.8s sort by key int: 0.59s vs 0.64s scala count: 0.09s vs 0.08s scala count w fltr: 0.31s vs 0.47s This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween). Cheers, Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Spark 2.0.0 performance; potential large Spark core regression
Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108) With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean regression with a very small scale factor and so we've generated a couple of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We will gather a 1.6.2 comparison and increase the scale factor. Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 are very limited and AFAIK don't interfere with Spark core functionality, so any feedback on the changes would be much appreciated and welcome, I'd much prefer it if my changes are the problem. A summary for your convenience follows (this matches what I've mentioned on the SparkPerf issue above) 1. spark-perf/config/config.py : SCALE_FACTOR=0.05 No. Of Workers: 1 Executor per Worker : 1 Executor Memory: 18G Driver Memory : 8G Serializer: kryo 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs Main changes I made for the benchmark itself Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem MLAlgorithmTests use Vectors.fromML For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination Trivial: we use compact not compact.render for outputting json In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method: 1. AppendOnlyMap.changeValue 44% 2. SortShuffleWriter.write 19% 3. SizeTracker.estimateSize 7.5% 4. SizeEstimator.estimate 5.36% 5. Range.foreach 3.6% and in 1.5.2 the top five methods are: 1. AppendOnlyMap.changeValue 38% 2. ExternalSorter.insertAll 33% 3. Range.foreach 4% 4. SizeEstimator.estimate 2% 5. SizeEstimator.visitSingleObject 2% I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time scheduling throughput: 5.2s vs 7.08s agg by key; 0.72s vs 1.01s agg by key int: 0.93s vs 1.19s agg by key naive: 1.88s vs 2.02 sort by key: 0.64s vs 0.8s sort by key int: 0.59s vs 0.64s scala count: 0.09s vs 0.08s scala count w fltr: 0.31s vs 0.47s This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween). Cheers, Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Understanding pyspark data flow on worker nodes
Hi, sharing what I discovered with PySpark too, corroborates with what Amit notices and also interested in the pipe question: h ttps://mail-archives.apache.org/mod_mbox/spark-dev/201603.mbox/%3c201603291521.u2tflbfo024...@d06av05.portsmouth.uk.ibm.com%3E // Start a thread to feed the process input from our parent's iterator val writerThread = new WriterThread(env, worker, inputIterator, partitionIndex, context) ... // Return an iterator that read lines from the process's stdout val stream = new DataInputStream(new BufferedInputStream(worker.getInputStream, bufferSize)) The above code and what follows look to be the important parts. Note that Josh Rosen replied to my comment with more information: "One clarification: there are Python interpreters running on executors so that Python UDFs and RDD API code can be executed. Some slightly-outdated but mostly-correct reference material for this can be found at https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. See also: search the Spark codebase for PythonRDD and look at python/pyspark/worker.py" From: Reynold Xin To: Amit Rana Cc: "dev@spark.apache.org" Date: 08/07/2016 07:03 Subject:Re: Understanding pyspark data flow on worker nodes You can look into its source code: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana wrote: Hi all, Did anyone get a chance to look into it?? Any sort of guidance will be much appreciated. Thanks, Amit Rana On 7 Jul 2016 14:28, "Amit Rana" wrote: As mentioned in the documentation: PythonRDD objects launch Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed. I am trying to understand the implementation of how this data transfer is happening using pipes. Can anyone please guide me along that line?? Thanks, Amit Rana On 7 Jul 2016 13:44, "Sun Rui" wrote: You can read https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals For pySpark data flow on worker nodes, you can read the source code of PythonRDD.scala. Python worker processes communicate with Spark executors via sockets instead of pipes. On Jul 7, 2016, at 15:49, Amit Rana wrote: Hi all, I am trying to trace the data flow in pyspark. I am using intellij IDEA in windows 7. I had submitted a python job as follows: --master local[4] I have made the following insights after running the above command in debug mode: ->Locally when a pyspark's interpreter starts, it also starts a JVM with which it communicates through socket. ->py4j is used to handle this communication ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext which communicates with the spark executors in cluster. In cluster I have read that the data flow between spark executors and python interpreter happens using pipes. But I am not able to trace that data flow. Please correct me if my understanding is wrong. It would be very helpful if, someone can help me understand tge code-flow for data transfer between JVM and python workers. Thanks, Amit Rana Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Databricks SparkPerf with Spark 2.0
Fixed the below problem, grepped for spark.version, noticed some instances of 1.5.2 being declared, changed to 2.0.0-preview in spark-tests/project/SparkTestsBuild.scala Next one to fix is: 16/06/14 12:52:44 INFO ContextCleaner: Cleaned shuffle 9 Exception in thread "main" java.lang.NoSuchMethodError: org/json4s/jackson/JsonMethods$.render$default$2(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/Formats; I'm going to log this and further progress under "Issues" for the project itself (probably need to change org.json4s version in SparkTestsBuild.scala, now I know this file is super important), so the emails here will at least point people there. Cheers, From: Adam Roberts/UK/IBM@IBMGB To: dev Date: 14/06/2016 12:18 Subject:Databricks SparkPerf with Spark 2.0 Hi, I'm working on having "SparkPerf" ( https://github.com/databricks/spark-perf) run with Spark 2.0, noticed a few pull requests not yet accepted so concerned this project's been abandoned - it's proven very useful in the past for quality assurance as we can easily exercise lots of Spark functions with a cluster (perhaps exposing problems that don't surface with the Spark unit tests). I want to use Scala 2.11.8 and Spark 2.0.0 so I'm making my way through various files, currently faced with a NoSuchMethod exception NoSuchMethodError: org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions; at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137) class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) { override def runTest(rdd: RDD[_], reduceTasks: Int) { rdd.asInstanceOf[RDD[(String, String)]] .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, reduceTasks).count() } } Grepping shows ./spark-tests/target/streams/compile/incCompileSetup/$global/streams/inc_compile_2.10:/home/aroberts/Desktop/spark-perf/spark-tests/src/main/scala/spark/perf/KVDataTest.scala -> rddToPairRDDFunctions The scheduling-throughput tests complete fine but the problem here is seen with agg-by-key (and likely other modules to fix owing to API changes between 1.x and 2.x which I guess is the cause of the above problem). Has anybody already made good progress here? Would like to work together and get this available for everyone, I'll be churning through it either way. Will be looking at HiBench also. Next step for me is to use sbt -Dspark.version=2.0.0 (2.0.0-preview?) and work from there, although I figured the prep tests stage would do this for me (how else is it going to build?). Cheers, Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Databricks SparkPerf with Spark 2.0
Hi, I'm working on having "SparkPerf" ( https://github.com/databricks/spark-perf) run with Spark 2.0, noticed a few pull requests not yet accepted so concerned this project's been abandoned - it's proven very useful in the past for quality assurance as we can easily exercise lots of Spark functions with a cluster (perhaps exposing problems that don't surface with the Spark unit tests). I want to use Scala 2.11.8 and Spark 2.0.0 so I'm making my way through various files, currently faced with a NoSuchMethod exception NoSuchMethodError: org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions; at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137) class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) { override def runTest(rdd: RDD[_], reduceTasks: Int) { rdd.asInstanceOf[RDD[(String, String)]] .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, reduceTasks).count() } } Grepping shows ./spark-tests/target/streams/compile/incCompileSetup/$global/streams/inc_compile_2.10:/home/aroberts/Desktop/spark-perf/spark-tests/src/main/scala/spark/perf/KVDataTest.scala -> rddToPairRDDFunctions The scheduling-throughput tests complete fine but the problem here is seen with agg-by-key (and likely other modules to fix owing to API changes between 1.x and 2.x which I guess is the cause of the above problem). Has anybody already made good progress here? Would like to work together and get this available for everyone, I'll be churning through it either way. Will be looking at HiBench also. Next step for me is to use sbt -Dspark.version=2.0.0 (2.0.0-preview?) and work from there, although I figured the prep tests stage would do this for me (how else is it going to build?). Cheers, Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Caching behaviour and deserialized size
Hi, Given a very simple test that uses a bigger version of the pom.xml file in our Spark home directory (cat with a bash for loop into itself so it becomes 100 MB), I've noticed with larger heap sizes it looks like we have more RDDs reported as being cached, is this intended behaviour? What exactly are we looking at, replicas perhaps (the resiliency in RDD) or partitions for the same RDD? With a 512 MB heap (max and initial size), regardless of JDK vendor: Looking for mybiggerpom.xml in the directory you're running this application from Added broadcast_0_piece0 in memory on 10.0.2.15:35762 (size: 15.8 KB, free: 159.0 MB) caching in memory Added broadcast_1_piece0 in memory on 10.0.2.15:35762 (size: 1789.0 B, free: 159.0 MB) Added rdd_1_0 in memory on 10.0.2.15:35762 (size: 110.7 MB, free: 48.3 MB) lines.count(): 2790700 Yet if I increase it to 1024 MB (again max and initial size), I see this: Looking for mybiggerpom.xml in the directory you're running this application from Added broadcast_0_piece0 in memory on 10.0.2.15:39739 (size: 15.8 KB, free: 543.0 MB) caching in memory Added broadcast_1_piece0 in memory on 10.0.2.15:39739 (size: 1789.0 B, free: 543.0 MB) Added rdd_1_0 in memory on 10.0.2.15:39739 (size: 110.7 MB, free: 432.3 MB) Added rdd_1_1 in memory on 10.0.2.15:39739 (size: 107.3 MB, free: 325.0 MB) Added rdd_1_2 in memory on 10.0.2.15:39739 (size: 107.0 MB, free: 218.1 MB) lines.count(): 2790700 My simple test case: //scalastyle:off import java.io.File import org.apache.spark._ import org.apache.spark.rdd._ object Trimmed { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf().setAppName("Adam RDD cached size experiment") .setMaster("local[1]")) var fileName = "mybiggerpom.xml" if (args != null && args.length > 0) { fileName = args(0) } println("Looking for " + fileName + " in the directory you're running this application from") val lines = sc.textFile(fileName) println("caching in memory") lines.cache() println("lines.count(): " + lines.count()) } } I also want to figure out where the cached RDD size value comes from and I noticed deserializedSize is used (in BlockManagerMasterEndpoint.scala), where does this value come from? I understand SizeEstimator plays a big role but it's unclear who's responsible for figuring out deserializedSize in the first place despite my best efforts with Intellij and a lot of grepping. I'm using recent Spark 2.0 code, any guidance here will be appreciated, cheers Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: BytesToBytes and unaligned memory
Ted, yes with the forced true value all tests pass, we use the unaligned check in 15 other suites. Our java.nio.Bits.unaligned() function checks that the detected os.arch value matches a list of known implementations (not including s390x). We could add it to the known architectures in the catch block but this won't make a difference here as because we call unaligned() OK (no exception is thrown), we don't reach the architecture checking stage anyway. I see in org.apache.spark.memory.MemoryManager that unaligned support is required for off-heap memory in Tungsten (perhaps incorrectly if no code ever exercises it in Spark?). Instead of having a requirement should we instead log a warning once that this is likely to lead to slow performance? What's the rationale for supporting unaligned memory access: it's my understanding that it's typically very slow, are there any design docs or perhaps a JIRA where I can learn more? Will run a simple test case exercising unaligned memory access for Linux on Z (without using Spark) and can also run the tests claiming to require unaligned memory access on a platform where unaligned memory access is definitely not supported for shorts/ints/longs. if these tests continue to pass then I think the Spark tests don't exercise unaligned memory access, cheers From: Ted Yu To: Adam Roberts/UK/IBM@IBMGB Cc: "dev@spark.apache.org" Date: 15/04/2016 17:35 Subject:Re: BytesToBytes and unaligned memory I am curious if all Spark unit tests pass with the forced true value for unaligned. If that is the case, it seems we can add s390x to the known architectures. It would also give us some more background if you can describe how java.nio.Bits#unaligned() is implemented on s390x. Josh / Andrew / Davies / Ryan are more familiar with related code. It would be good to hear what they think. Thanks On Fri, Apr 15, 2016 at 8:47 AM, Adam Roberts wrote: Ted, yeah with the forced true value the tests in that suite all pass and I know they're being executed thanks to prints I've added Cheers, From:Ted Yu To:Adam Roberts/UK/IBM@IBMGB Cc:"dev@spark.apache.org" Date:15/04/2016 16:43 Subject:Re: BytesToBytes and unaligned memory Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with the forced true value for unaligned ? If the test failed, please pastebin the failure(s). Thanks On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts wrote: Ted, yep I'm working from the latest code which includes that unaligned check, for experimenting I've modified that code to ignore the unaligned check (just go ahead and say we support it anyway, even though our JDK returns false: the return value of java.nio.Bits.unaligned()). My Platform.java for testing contains: private static final boolean unaligned; static { boolean _unaligned; // use reflection to access unaligned field try { System.out.println("Checking unaligned support"); Class bitsClass = Class.forName("java.nio.Bits", false, ClassLoader.getSystemClassLoader()); Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned"); unalignedMethod.setAccessible(true); _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null)); System.out.println("Used reflection and _unaligned is: " + _unaligned); System.out.println("Setting to true anyway for experimenting"); _unaligned = true; } catch (Throwable t) { // We at least know x86 and x64 support unaligned access. String arch = System.getProperty("os.arch", ""); //noinspection DynamicRegexReplaceableByCompiledPattern // We don't actually get here since we find the unaligned method OK and it returns false (I override with true anyway) // but add s390x incase we somehow fail anyway. System.out.println("Checking for s390x, os.arch is: " + arch); _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$"); } unaligned = _unaligned; System.out.println("returning: " + unaligned); } } Output is, as you'd expect, "used reflection and _unaligned is false, setting to true anyway for experimenting", and the tests pass. No other problems on the platform (pending a different pull request). Cheers, From:Ted Yu To:Adam Roberts/UK/IBM@IBMGB Cc:"dev@spark.apache.org" Date:15/04/2016 15:32 Subject:Re: BytesToBytes and unaligned memory I assume you tested 2.0 with SPARK-12181 . Related code from Platform.java if java.nio.Bits#unaligned() throws exception: // We at least know x86 and x64 support unaligned access. String arch = System.getProperty("os.arch",
Re: BytesToBytes and unaligned memory
Ted, yeah with the forced true value the tests in that suite all pass and I know they're being executed thanks to prints I've added Cheers, From: Ted Yu To: Adam Roberts/UK/IBM@IBMGB Cc: "dev@spark.apache.org" Date: 15/04/2016 16:43 Subject:Re: BytesToBytes and unaligned memory Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with the forced true value for unaligned ? If the test failed, please pastebin the failure(s). Thanks On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts wrote: Ted, yep I'm working from the latest code which includes that unaligned check, for experimenting I've modified that code to ignore the unaligned check (just go ahead and say we support it anyway, even though our JDK returns false: the return value of java.nio.Bits.unaligned()). My Platform.java for testing contains: private static final boolean unaligned; static { boolean _unaligned; // use reflection to access unaligned field try { System.out.println("Checking unaligned support"); Class bitsClass = Class.forName("java.nio.Bits", false, ClassLoader.getSystemClassLoader()); Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned"); unalignedMethod.setAccessible(true); _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null)); System.out.println("Used reflection and _unaligned is: " + _unaligned); System.out.println("Setting to true anyway for experimenting"); _unaligned = true; } catch (Throwable t) { // We at least know x86 and x64 support unaligned access. String arch = System.getProperty("os.arch", ""); //noinspection DynamicRegexReplaceableByCompiledPattern // We don't actually get here since we find the unaligned method OK and it returns false (I override with true anyway) // but add s390x incase we somehow fail anyway. System.out.println("Checking for s390x, os.arch is: " + arch); _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$"); } unaligned = _unaligned; System.out.println("returning: " + unaligned); } } Output is, as you'd expect, "used reflection and _unaligned is false, setting to true anyway for experimenting", and the tests pass. No other problems on the platform (pending a different pull request). Cheers, From:Ted Yu To:Adam Roberts/UK/IBM@IBMGB Cc:"dev@spark.apache.org" Date:15/04/2016 15:32 Subject:Re: BytesToBytes and unaligned memory I assume you tested 2.0 with SPARK-12181 . Related code from Platform.java if java.nio.Bits#unaligned() throws exception: // We at least know x86 and x64 support unaligned access. String arch = System.getProperty("os.arch", ""); //noinspection DynamicRegexReplaceableByCompiledPattern _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|amd64)$"); Can you give us some detail on how the code runs for JDKs on zSystems ? Thanks On Fri, Apr 15, 2016 at 7:01 AM, Adam Roberts wrote: Hi, I'm testing Spark 2.0.0 on various architectures and have a question, are we sure if core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java really is attempting to use unaligned memory access (for the BytesToBytesMapOffHeapSuite tests specifically)? Our JDKs on zSystems for example return false for the java.nio.Bits.unaligned() method and yet if I skip this check and add s390x to the supported architectures (for zSystems), all thirteen tests here pass. The 13 tests here all fail as we do not pass the unaligned requirement (but perhaps incorrectly): core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java and I know the unaligned checking is at common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java Either our JDK's method is returning false incorrectly or this test isn't using unaligned memory access (so the requirement is invalid), there's no mention of alignment in the test itself. Any guidance would be very much appreciated, cheers Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: BytesToBytes and unaligned memory
Ted, yep I'm working from the latest code which includes that unaligned check, for experimenting I've modified that code to ignore the unaligned check (just go ahead and say we support it anyway, even though our JDK returns false: the return value of java.nio.Bits.unaligned()). My Platform.java for testing contains: private static final boolean unaligned; static { boolean _unaligned; // use reflection to access unaligned field try { System.out.println("Checking unaligned support"); Class bitsClass = Class.forName("java.nio.Bits", false, ClassLoader.getSystemClassLoader()); Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned"); unalignedMethod.setAccessible(true); _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null)); System.out.println("Used reflection and _unaligned is: " + _unaligned); System.out.println("Setting to true anyway for experimenting"); _unaligned = true; } catch (Throwable t) { // We at least know x86 and x64 support unaligned access. String arch = System.getProperty("os.arch", ""); //noinspection DynamicRegexReplaceableByCompiledPattern // We don't actually get here since we find the unaligned method OK and it returns false (I override with true anyway) // but add s390x incase we somehow fail anyway. System.out.println("Checking for s390x, os.arch is: " + arch); _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$"); } unaligned = _unaligned; System.out.println("returning: " + unaligned); } } Output is, as you'd expect, "used reflection and _unaligned is false, setting to true anyway for experimenting", and the tests pass. No other problems on the platform (pending a different pull request). Cheers, From: Ted Yu To: Adam Roberts/UK/IBM@IBMGB Cc: "dev@spark.apache.org" Date: 15/04/2016 15:32 Subject:Re: BytesToBytes and unaligned memory I assume you tested 2.0 with SPARK-12181 . Related code from Platform.java if java.nio.Bits#unaligned() throws exception: // We at least know x86 and x64 support unaligned access. String arch = System.getProperty("os.arch", ""); //noinspection DynamicRegexReplaceableByCompiledPattern _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|amd64)$"); Can you give us some detail on how the code runs for JDKs on zSystems ? Thanks On Fri, Apr 15, 2016 at 7:01 AM, Adam Roberts wrote: Hi, I'm testing Spark 2.0.0 on various architectures and have a question, are we sure if core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java really is attempting to use unaligned memory access (for the BytesToBytesMapOffHeapSuite tests specifically)? Our JDKs on zSystems for example return false for the java.nio.Bits.unaligned() method and yet if I skip this check and add s390x to the supported architectures (for zSystems), all thirteen tests here pass. The 13 tests here all fail as we do not pass the unaligned requirement (but perhaps incorrectly): core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java and I know the unaligned checking is at common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java Either our JDK's method is returning false incorrectly or this test isn't using unaligned memory access (so the requirement is invalid), there's no mention of alignment in the test itself. Any guidance would be very much appreciated, cheers Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
BytesToBytes and unaligned memory
Hi, I'm testing Spark 2.0.0 on various architectures and have a question, are we sure if core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java really is attempting to use unaligned memory access (for the BytesToBytesMapOffHeapSuite tests specifically)? Our JDKs on zSystems for example return false for the java.nio.Bits.unaligned() method and yet if I skip this check and add s390x to the supported architectures (for zSystems), all thirteen tests here pass. The 13 tests here all fail as we do not pass the unaligned requirement (but perhaps incorrectly): core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java and I know the unaligned checking is at common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java Either our JDK's method is returning false incorrectly or this test isn't using unaligned memory access (so the requirement is invalid), there's no mention of alignment in the test itself. Any guidance would be very much appreciated, cheers Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Understanding PySpark Internals
Hi, I'm interested in figuring out how the Python API for Spark works, I've came to the following conclusion and want to share this with the community; could be of use in the PySpark docs here, specifically the "Execution and pipelining part". Any sanity checking would be much appreciated, here's the trivial Python example I've traced: from pyspark import SparkContext sc = SparkContext("local[1]", "Adam test") sc.setCheckpointDir("foo checkpoint dir") Added this JVM option: export IBM_JAVA_OPTIONS="-Xtrace:methods={org/apache/spark/*,py4j/*},print=mt" Prints added in py4j-java/src/py4j/commands/CallCommand.java - specifically in the execute method. Built and replaced existing class in the py4j 0.9 jar in my Spark assembly jar. Example output is: In execute for CallCommand, commandName: c target object id: o0 methodName: get I'll launch the Spark application with: $SPARK_HOME/bin/spark-submit --master local[1] Adam.py > checkme.txt 2>&1 I've quickly put together the following WIP diagram of what I think is happening: http://postimg.org/image/nihylmset/ To summarise I think: We're heavily using reflection (as evidenced by Py4j's ReflectionEngine and MethodInvoker classes) to invoke Spark's API in a JVM from Python There's an agreed protocol (in Py4j's Protocol.java) for handling commands: said commands are exchanged using a local socket between Python and our JVM (the driver based on docs, not the master) The Spark API is accessible by means of commands exchanged using said socket using the agreed protocol Commands are read/written using BufferedReader/Writer Type conversion is also performed from Python to Java (not looked at in detail yet) We keep track of the objects with, for example, o0 representing the first object we know about Does this sound correct? I've only checked the trace output in local mode, curious as to what happens when we're running in standalone mode (I didn't see a Python interpreter appearing on all workers in order to process partitions of data, I assume in standalone mode we use Python solely as an orchestrator - the driver - and not as an executor for distributed computing?). Happy to provide the full trace output on request (omitted timestamps, logging info, added spacing), I expect there's a O*JDK method tracing equivalent so the above can easily be reproduced regardless of Java vendor. Cheers, Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Tungsten in a mixed endian environment
Hi all, I've been experimenting with DataFrame operations in a mixed endian environment - a big endian master with little endian workers. With tungsten enabled I'm encountering data corruption issues. For example, with this simple test code: import org.apache.spark.SparkContext import org.apache.spark._ import org.apache.spark.sql.SQLContext object SimpleSQL { def main(args: Array[String]): Unit = { if (args.length != 1) { println("Not enough args, you need to specify the master url") } val masterURL = args(0) println("Setting up Spark context at: " + masterURL) val sparkConf = new SparkConf val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf) println("Performing SQL tests") val sqlContext = new SQLContext(sc) println("SQL context set up") val df = sqlContext.read.json("/tmp/people.json") df.show() println("Selecting everyone's age and adding one to it") df.select(df("name"), df("age") + 1).show() println("Showing all people over the age of 21") df.filter(df("age") > 21).show() println("Counting people by age") df.groupBy("age").count().show() } } Instead of getting ++-+ | age|count| ++-+ |null|1| | 19|1| | 30|1| ++-+ I get the following with my mixed endian set up: +---+-+ |age|count| +---+-+ | null|1| |1369094286720630784|72057594037927936| | 30|1| +---+-+ and on another run: +---+-+ |age|count| +---+-+ | 0|72057594037927936| | 19|1| Is Spark expected to work in such an environment? If I turn off tungsten ( sparkConf.set("spark.sql.tungsten.enabled", "false"), in 20 runs I don't see any problems. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Test workflow - blacklist entire suites and run any independently
Thanks Josh, I should have added that we've tried with -DwildcardSuites and Maven and we use this helpful feature regularly (although this does result in building plenty of tests and running other tests in other modules too), so wondering if there's a more "streamlined" way - e.g. with junit and eclipse we'd just right click one individual unit test and that'd be run - without building again AFAIK Unfortunately using sbt causes a lot of pain, such as... [error] [error] last tree to typer: Literal(Constant(org.apache.spark.sql.test.ExamplePoint)) [error] symbol: null [error]symbol definition: null [error] tpe: Class(classOf[org.apache.spark.sql.test.ExamplePoint]) [error]symbol owners: [error] context owners: class ExamplePointUDT -> package test [error] and then an awfully long stacktrace with plenty of errors. Must be an easier way... From: Josh Rosen To: Adam Roberts/UK/IBM@IBMGB Cc: dev Date: 21/09/2015 19:19 Subject:Re: Test workflow - blacklist entire suites and run any independently For quickly running individual suites: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests On Mon, Sep 21, 2015 at 8:21 AM, Adam Roberts wrote: Hi, is there an existing way to blacklist any test suite? Ideally we'd have a text file with a series of names (let's say comma separated) and if a name matches with the fully qualified class name for a suite, this suite will be skipped. Perhaps we can achieve this via ScalaTest or Maven? Currently if a number of suites are failing we're required to comment these out, commit and push this change then kick off a Jenkins job (perhaps building a custom branch) - not ideal when working with Jenkins, would be quicker to use such a mechanism as described above as opposed to having a few branches that are a little different from others. Also, how can we quickly only run any one suite within, say, sql/hive? -f sql/hive/pom.xml with -nsu results in compile failures each time. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Test workflow - blacklist entire suites and run any independently
Hi, is there an existing way to blacklist any test suite? Ideally we'd have a text file with a series of names (let's say comma separated) and if a name matches with the fully qualified class name for a suite, this suite will be skipped. Perhaps we can achieve this via ScalaTest or Maven? Currently if a number of suites are failing we're required to comment these out, commit and push this change then kick off a Jenkins job (perhaps building a custom branch) - not ideal when working with Jenkins, would be quicker to use such a mechanism as described above as opposed to having a few branches that are a little different from others. Also, how can we quickly only run any one suite within, say, sql/hive? -f sql/hive/pom.xml with -nsu results in compile failures each time. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU