Re: withExpr private method duplication in Column and functions objects?

2016-11-11 Thread Reynold Xin
private[sql] has no impact in Java, and these functions are literally one
line of code. It's overkill to think about code duplication for functions
that simple.



On Fri, Nov 11, 2016 at 1:12 PM, Jacek Laskowski  wrote:

> Hi,
>
> Any reason for withExpr duplication in Column [1] and functions [2]
> objects? It looks like it could be less private and be at least
> private[sql]?
>
> private def withExpr(newExpr: Expression): Column = new Column(newExpr)
>
> [1] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/Column.scala#L152
> [2] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/functions.scala#L60
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


withExpr private method duplication in Column and functions objects?

2016-11-11 Thread Jacek Laskowski
Hi,

Any reason for withExpr duplication in Column [1] and functions [2]
objects? It looks like it could be less private and be at least
private[sql]?

private def withExpr(newExpr: Expression): Column = new Column(newExpr)

[1] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L152
[2] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L60

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-11 Thread Reynold Xin
The vote has passed with the following +1s and no -1. I will work on
packaging the release.

+1:

Reynold Xin*
Herman van Hövell tot Westerflier
Ricardo Almeida
Shixiong (Ryan) Zhu
Sean Owen*
Michael Armbrust*
Dongjoon Hyun
Jagadeesan As
Liwei Lin
Weiqing Yang
Vaquar Khan
Denny Lee
Yin Huai*
Ryan Blue
Pratik Sharma
Kousuke Saruta
Tathagata Das*
Mingjie Tang
Adam Roberts

* = binding


On Mon, Nov 7, 2016 at 10:09 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
>
>
> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b143367
> ba694b0c34)
>
> This release candidate resolves 84 issues: https://s.apache.org/spark-2.
> 0.2-jira
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>
>
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
>
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
>
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>


spark sql query of nested json lists data

2016-11-11 Thread robert
I am new to the spark sql development. I have a json file with nested arrays.
I can extract/query these arrays. However, when I add order by clause, I get
exceptions: here is the step:
1) val a = sparkSession.sql("SELECT Tables.TableName, Tables.TableType,
Tables.TableExecOrder, Tables.Columns FROM tblConfig LATERAL VIEW
explode(TargetTable.Tables[0]) s AS Tables")
a.show(5)
output: 
+-+-+--++
|TableName|TableType|TableExecOrder| Columns|
+-+-+--++
|  TB0|Final| 0|[[name,INT], [nam...|
|  TB1| temp| 2|[[name,INT], [nam...|
|  TB2| temp| 1|[[name,INT], [nam...|
+-+-+--++

2) a.createOrReplaceTempView("aa")
sparkSession.sql("SELECT TableName, TableExecOrder, Columns FROM aa
order by TableExecOrder").show(5)

   Output: exception

16/11/11 11:17:00 ERROR TaskResultGetter: Exception while getting task
result
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
underlying (org.apache.spark.util.BoundedPriorityQueue)
at
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
at
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
at 
org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
at
org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
at java.util.PriorityQueue.offer(PriorityQueue.java:344)
at java.util.PriorityQueue.add(PriorityQueue.java:321)
at
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
at
com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
... 15 more


how can I fix this?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-sql-query-of-nested-json-lists-data-tp19828.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-11 Thread Dongjoon Hyun
Hi.

Now, do we have Apache Spark 2.0.2? :)

Bests,
Dongjoon.

On 2016-11-07 22:09 (-0800), Reynold Xin  wrote: 
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
> a majority of at least 3+1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.0.2
> [ ] -1 Do not release this package because ...
> 
> 
> The tag to be voted on is v2.0.2-rc3
> (584354eaac02531c9584188b143367ba694b0c34)
> 
> This release candidate resolves 84 issues:
> https://s.apache.org/spark-2.0.2-jira
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1214/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
> 
> 
> Q: How can I help test this release?
> A: If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions from 2.0.1.
> 
> Q: What justifies a -1 vote for this release?
> A: This is a maintenance release in the 2.0.x series. Bugs already present
> in 2.0.1, missing features, or bugs related to new features will not
> necessarily block this release.
> 
> Q: What fix version should I use for patches merging into branch-2.0 from
> now on?
> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-11 Thread WangJianfei
when we train the mode, we will use the data with a subSampleRate, so if the
subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
se the code below in GradientBoostedTrees.boost()

 while (m < numIterations && !doneLearning) {
  // Update data with pseudo-residuals 剩余误差
  val data = predError.zip(input).map { case ((pred, _), point) =>
LabeledPoint(-loss.gradient(pred, point.label), point.features)
  }

  timer.start(s"building tree $m")
  logDebug("###")
  logDebug("Gradient boosting tree iteration " + m)
  logDebug("###")
  val dt = new DecisionTreeRegressor().setSeed(seed + m)
  val model = dt.train(data, treeStrategy)
   




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-in-GradientBoostedTrees-if-subsamplingRate-1-0-tp19826.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-11 Thread Adam Roberts
+1 (non-binding)

Build: mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver 
-DskipTests clean package
Test: mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver 
-Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test
Test options: -Xss2048k -Dspark.buffer.pageSize=1048576 -Xmx4g

No problems with OpenJDK 8 on x86.

No problems with the latest IBM Java 8, various architectures including 
big and little-endian, various operating systems including RHEL 72, CentOS 
72, Ubuntu 14 04, Ubuntu 16 04, SUSE 12.

No issues with the Python tests.

No performance concerns with HiBench large.




From:   Mingjie Tang 
To: Tathagata Das 
Cc: Kousuke Saruta , Reynold Xin 
, dev 
Date:   11/11/2016 03:44
Subject:Re: [VOTE] Release Apache Spark 2.0.2 (RC3)



+1 (non-binding)

On Thu, Nov 10, 2016 at 6:06 PM, Tathagata Das <
tathagata.das1...@gmail.com> wrote:
+1 binding

On Thu, Nov 10, 2016 at 6:05 PM, Kousuke Saruta  wrote:
+1 (non-binding)


On 2016年11月08日 15:09, Reynold Xin wrote:
Please vote on releasing the following candidate as Apache Spark version 
2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if 
a majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.2-rc3 
(584354eaac02531c9584188b143367ba694b0c34)

This release candidate resolves 84 issues: 
https://s.apache.org/spark-2.0.2-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ <
http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-bin/>

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1214/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ <
http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-docs/>


Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions from 2.0.1.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series. Bugs already present 
in 2.0.1, missing features, or bugs related to new features will not 
necessarily block this release.

Q: What fix version should I use for patches merging into branch-2.0 from 
now on?
A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC 
(i.e. RC4) is cut, I will change the fix version of those patches to 
2.0.2.


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU