Re: withExpr private method duplication in Column and functions objects?
private[sql] has no impact in Java, and these functions are literally one line of code. It's overkill to think about code duplication for functions that simple. On Fri, Nov 11, 2016 at 1:12 PM, Jacek Laskowski wrote: > Hi, > > Any reason for withExpr duplication in Column [1] and functions [2] > objects? It looks like it could be less private and be at least > private[sql]? > > private def withExpr(newExpr: Expression): Column = new Column(newExpr) > > [1] https://github.com/apache/spark/blob/master/sql/core/ > src/main/scala/org/apache/spark/sql/Column.scala#L152 > [2] https://github.com/apache/spark/blob/master/sql/core/ > src/main/scala/org/apache/spark/sql/functions.scala#L60 > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >
withExpr private method duplication in Column and functions objects?
Hi, Any reason for withExpr duplication in Column [1] and functions [2] objects? It looks like it could be less private and be at least private[sql]? private def withExpr(newExpr: Expression): Column = new Column(newExpr) [1] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L152 [2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L60 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [VOTE] Release Apache Spark 2.0.2 (RC3)
The vote has passed with the following +1s and no -1. I will work on packaging the release. +1: Reynold Xin* Herman van Hövell tot Westerflier Ricardo Almeida Shixiong (Ryan) Zhu Sean Owen* Michael Armbrust* Dongjoon Hyun Jagadeesan As Liwei Lin Weiqing Yang Vaquar Khan Denny Lee Yin Huai* Ryan Blue Pratik Sharma Kousuke Saruta Tathagata Das* Mingjie Tang Adam Roberts * = binding On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if > a majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 2.0.2 > [ ] -1 Do not release this package because ... > > > The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367 > ba694b0c34) > > This release candidate resolves 84 issues: https://s.apache.org/spark-2. > 0.2-jira > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1214/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ > > > Q: How can I help test this release? > A: If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions from 2.0.1. > > Q: What justifies a -1 vote for this release? > A: This is a maintenance release in the 2.0.x series. Bugs already present > in 2.0.1, missing features, or bugs related to new features will not > necessarily block this release. > > Q: What fix version should I use for patches merging into branch-2.0 from > now on? > A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC > (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2. >
spark sql query of nested json lists data
I am new to the spark sql development. I have a json file with nested arrays. I can extract/query these arrays. However, when I add order by clause, I get exceptions: here is the step: 1) val a = sparkSession.sql("SELECT Tables.TableName, Tables.TableType, Tables.TableExecOrder, Tables.Columns FROM tblConfig LATERAL VIEW explode(TargetTable.Tables[0]) s AS Tables") a.show(5) output: +-+-+--++ |TableName|TableType|TableExecOrder| Columns| +-+-+--++ | TB0|Final| 0|[[name,INT], [nam...| | TB1| temp| 2|[[name,INT], [nam...| | TB2| temp| 1|[[name,INT], [nam...| +-+-+--++ 2) a.createOrReplaceTempView("aa") sparkSession.sql("SELECT TableName, TableExecOrder, Columns FROM aa order by TableExecOrder").show(5) Output: exception 16/11/11 11:17:00 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: underlying (org.apache.spark.util.BoundedPriorityQueue) at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790) at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790) at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793) at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) at java.util.PriorityQueue.offer(PriorityQueue.java:344) at java.util.PriorityQueue.add(PriorityQueue.java:321) at com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) at com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708) at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) ... 15 more how can I fix this? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-sql-query-of-nested-json-lists-data-tp19828.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [VOTE] Release Apache Spark 2.0.2 (RC3)
Hi. Now, do we have Apache Spark 2.0.2? :) Bests, Dongjoon. On 2016-11-07 22:09 (-0800), Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if > a majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 2.0.2 > [ ] -1 Do not release this package because ... > > > The tag to be voted on is v2.0.2-rc3 > (584354eaac02531c9584188b143367ba694b0c34) > > This release candidate resolves 84 issues: > https://s.apache.org/spark-2.0.2-jira > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1214/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ > > > Q: How can I help test this release? > A: If you are a Spark user, you can help us test this release by taking an > existing Spark workload and running on this release candidate, then > reporting any regressions from 2.0.1. > > Q: What justifies a -1 vote for this release? > A: This is a maintenance release in the 2.0.x series. Bugs already present > in 2.0.1, missing features, or bugs related to new features will not > necessarily block this release. > > Q: What fix version should I use for patches merging into branch-2.0 from > now on? > A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC > (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2. > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0
when we train the mode, we will use the data with a subSampleRate, so if the subSampleRate < 1.0 , we can do a sample first to reduce the memory usage. se the code below in GradientBoostedTrees.boost() while (m < numIterations && !doneLearning) { // Update data with pseudo-residuals 剩余误差 val data = predError.zip(input).map { case ((pred, _), point) => LabeledPoint(-loss.gradient(pred, point.label), point.features) } timer.start(s"building tree $m") logDebug("###") logDebug("Gradient boosting tree iteration " + m) logDebug("###") val dt = new DecisionTreeRegressor().setSeed(seed + m) val model = dt.train(data, treeStrategy) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-in-GradientBoostedTrees-if-subsamplingRate-1-0-tp19826.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [VOTE] Release Apache Spark 2.0.2 (RC3)
+1 (non-binding) Build: mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package Test: mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test Test options: -Xss2048k -Dspark.buffer.pageSize=1048576 -Xmx4g No problems with OpenJDK 8 on x86. No problems with the latest IBM Java 8, various architectures including big and little-endian, various operating systems including RHEL 72, CentOS 72, Ubuntu 14 04, Ubuntu 16 04, SUSE 12. No issues with the Python tests. No performance concerns with HiBench large. From: Mingjie Tang To: Tathagata Das Cc: Kousuke Saruta , Reynold Xin , dev Date: 11/11/2016 03:44 Subject:Re: [VOTE] Release Apache Spark 2.0.2 (RC3) +1 (non-binding) On Thu, Nov 10, 2016 at 6:06 PM, Tathagata Das < tathagata.das1...@gmail.com> wrote: +1 binding On Thu, Nov 10, 2016 at 6:05 PM, Kousuke Saruta wrote: +1 (non-binding) On 2016年11月08日 15:09, Reynold Xin wrote: Please vote on releasing the following candidate as Apache Spark version 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.2 [ ] -1 Do not release this package because ... The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367ba694b0c34) This release candidate resolves 84 issues: https://s.apache.org/spark-2.0.2-jira The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ < http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-bin/> Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1214/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ < http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-docs/> Q: How can I help test this release? A: If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions from 2.0.1. Q: What justifies a -1 vote for this release? A: This is a maintenance release in the 2.0.x series. Bugs already present in 2.0.1, missing features, or bugs related to new features will not necessarily block this release. Q: What fix version should I use for patches merging into branch-2.0 from now on? A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU