date:20160226

spark git commit: [SPARK-13518][SQL] Enable vectorized parquet scanner by default

2016-02-26 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 59e3e10be -> 7a0cb4e58


[SPARK-13518][SQL] Enable vectorized parquet scanner by default

## What changes were proposed in this pull request?

Change the default of the flag to enable this feature now that the 
implementation is complete.

## How was this patch tested?

The new parquet reader should be a drop in, so will be exercised by the 
existing tests.

Author: Nong Li 

Closes #11397 from nongli/spark-13518.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7a0cb4e5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7a0cb4e5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7a0cb4e5

Branch: refs/heads/master
Commit: 7a0cb4e58728834b49050ce4fae418acc18a601f
Parents: 59e3e10
Author: Nong Li 
Authored: Fri Feb 26 22:36:32 2016 -0800
Committer: Reynold Xin 
Committed: Fri Feb 26 22:36:32 2016 -0800

--
 .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala  | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7a0cb4e5/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index 9a50ef7..1d1e288 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -345,12 +345,9 @@ object SQLConf {
 defaultValue = Some(true),
 doc = "Enables using the custom ParquetUnsafeRowRecordReader.")
 
-  // Note: this can not be enabled all the time because the reader will not be 
returning UnsafeRows.
-  // Doing so is very expensive and we should remove this requirement instead 
of fixing it here.
-  // Initial testing seems to indicate only sort requires this.
   val PARQUET_VECTORIZED_READER_ENABLED = booleanConf(
 key = "spark.sql.parquet.enableVectorizedReader",
-defaultValue = Some(false),
+defaultValue = Some(true),
 doc = "Enables vectorized parquet decoding.")
 
   val ORC_FILTER_PUSHDOWN_ENABLED = booleanConf("spark.sql.orc.filterPushdown",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts

2016-02-26 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master f77dc4e1e -> 59e3e10be


[SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts

## What changes were proposed in this pull request?
We provide a very limited set of cluster management script in Spark for 
Tachyon, although Tachyon itself provides a much better version of it. Given 
now Spark users can simply use Tachyon as a normal file system and does not 
require extensive configurations, we can remove this management capabilities to 
simplify Spark bash scripts.

Note that this also reduces coupling between a 3rd party external system and 
Spark's release scripts, and would eliminate possibility for failures such as 
Tachyon being renamed or the tar balls being relocated.

## How was this patch tested?
N/A

Author: Reynold Xin 

Closes #11400 from rxin/release-script.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/59e3e10b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/59e3e10b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/59e3e10b

Branch: refs/heads/master
Commit: 59e3e10be2f9a1c53979ca72c038adb4fa17ca64
Parents: f77dc4e
Author: Reynold Xin 
Authored: Fri Feb 26 22:35:12 2016 -0800
Committer: Reynold Xin 
Committed: Fri Feb 26 22:35:12 2016 -0800

--
 docs/configuration.md | 24 
 docs/job-scheduling.md|  3 +--
 docs/programming-guide.md | 22 ++-
 make-distribution.sh  | 50 +-
 sbin/start-all.sh | 15 ++---
 sbin/start-master.sh  | 21 --
 sbin/start-slaves.sh  | 22 ---
 sbin/stop-master.sh   |  4 
 sbin/stop-slaves.sh   |  5 -
 9 files changed, 6 insertions(+), 160 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/59e3e10b/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 4b1b007..e9b6623 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -929,30 +929,6 @@ Apart from these, the following properties are also 
available, and may be useful
 mapping has high overhead for blocks close to or below the page size of 
the operating system.
   
 
-
-  spark.externalBlockStore.blockManager
-  org.apache.spark.storage.TachyonBlockManager
-  
-Implementation of external block manager (file system) that store RDDs. 
The file system's URL is set by
-spark.externalBlockStore.url.
-  
-
-
-  spark.externalBlockStore.baseDir
-  System.getProperty("java.io.tmpdir")
-  
-Directories of the external block store that store RDDs. The file system's 
URL is set by
-   spark.externalBlockStore.url It can also be a comma-separated 
list of multiple
-directories on Tachyon file system.
-  
-
-
-  spark.externalBlockStore.url
-  tachyon://localhost:19998 for Tachyon
-  
-The URL of the underlying external blocker file system in the external 
block store.
-  
-
 
 
  Networking

http://git-wip-us.apache.org/repos/asf/spark/blob/59e3e10b/docs/job-scheduling.md
--
diff --git a/docs/job-scheduling.md b/docs/job-scheduling.md
index 95d4779..00b6a18 100644
--- a/docs/job-scheduling.md
+++ b/docs/job-scheduling.md
@@ -54,8 +54,7 @@ an application to gain back cores on one node when it has 
work to do. To use thi
 
 Note that none of the modes currently provide memory sharing across 
applications. If you would like to share
 data this way, we recommend running a single server application that can serve 
multiple requests by querying
-the same RDDs. In future releases, in-memory storage systems such as 
[Tachyon](http://tachyon-project.org) will
-provide another approach to share RDDs.
+the same RDDs.
 
 ## Dynamic Resource Allocation
 

http://git-wip-us.apache.org/repos/asf/spark/blob/59e3e10b/docs/programming-guide.md
--
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 5ebafa4..2f0ed5e 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1177,7 +1177,7 @@ that originally created it.
 
 In addition, each persisted RDD can be stored using a different *storage 
level*, allowing you, for example,
 to persist the dataset on disk, persist it in memory but as serialized Java 
objects (to save space),
-replicate it across nodes, or store it off-heap in 
[Tachyon](http://tachyon-project.org/).
+replicate it across nodes.
 These levels are set by passing a
 `StorageLevel` object 
([Scala](api/scala/index.html#org.apache.spark.storage.StorageLevel),
 [Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html),
@@ -1218,24 +1218,11 @@

[2/2] spark git commit: Preparing development version 1.6.2-SNAPSHOT

2016-02-26 Thread pwendell

Preparing development version 1.6.2-SNAPSHOT


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dcf60d79
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dcf60d79
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dcf60d79

Branch: refs/heads/branch-1.6
Commit: dcf60d79e2c46b7bafb67020e89623b060ab462b
Parents: 15de51c
Author: Patrick Wendell 
Authored: Fri Feb 26 20:09:11 2016 -0800
Committer: Patrick Wendell 
Committed: Fri Feb 26 20:09:11 2016 -0800

--
 assembly/pom.xml| 2 +-
 bagel/pom.xml   | 2 +-
 core/pom.xml| 2 +-
 docker-integration-tests/pom.xml| 2 +-
 examples/pom.xml| 2 +-
 external/flume-assembly/pom.xml | 2 +-
 external/flume-sink/pom.xml | 2 +-
 external/flume/pom.xml  | 2 +-
 external/kafka-assembly/pom.xml | 2 +-
 external/kafka/pom.xml  | 2 +-
 external/mqtt-assembly/pom.xml  | 2 +-
 external/mqtt/pom.xml   | 2 +-
 external/twitter/pom.xml| 2 +-
 external/zeromq/pom.xml | 2 +-
 extras/java8-tests/pom.xml  | 2 +-
 extras/kinesis-asl-assembly/pom.xml | 2 +-
 extras/kinesis-asl/pom.xml  | 2 +-
 extras/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml  | 2 +-
 launcher/pom.xml| 2 +-
 mllib/pom.xml   | 2 +-
 network/common/pom.xml  | 2 +-
 network/shuffle/pom.xml | 2 +-
 network/yarn/pom.xml| 2 +-
 pom.xml | 2 +-
 repl/pom.xml| 2 +-
 sql/catalyst/pom.xml| 2 +-
 sql/core/pom.xml| 2 +-
 sql/hive-thriftserver/pom.xml   | 2 +-
 sql/hive/pom.xml| 2 +-
 streaming/pom.xml   | 2 +-
 tags/pom.xml| 2 +-
 tools/pom.xml   | 2 +-
 unsafe/pom.xml  | 2 +-
 yarn/pom.xml| 2 +-
 35 files changed, 35 insertions(+), 35 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 87559fd..ff34522 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1
+1.6.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index 1050217..e31cf2f 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1
+1.6.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 1399103..4f909ba 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1
+1.6.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/docker-integration-tests/pom.xml
--
diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml
index 57e64df..a136781 100644
--- a/docker-integration-tests/pom.xml
+++ b/docker-integration-tests/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1
+1.6.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index a713135..ed97dc2 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1
+1.6.2-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index cbc64bd..0d6b563 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1
+1.6.2-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/external/flume-sink/pom.xml
--
diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml
index 26a1456

[1/2] spark git commit: Preparing Spark release v1.6.1-rc1

2016-02-26 Thread pwendell

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 eb6f6da48 -> dcf60d79e


Preparing Spark release v1.6.1-rc1


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/15de51c2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/15de51c2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/15de51c2

Branch: refs/heads/branch-1.6
Commit: 15de51c238a7340fa81cb0b80d029a05d97bfc5c
Parents: eb6f6da
Author: Patrick Wendell 
Authored: Fri Feb 26 20:09:04 2016 -0800
Committer: Patrick Wendell 
Committed: Fri Feb 26 20:09:04 2016 -0800

--
 assembly/pom.xml| 2 +-
 bagel/pom.xml   | 2 +-
 core/pom.xml| 2 +-
 docker-integration-tests/pom.xml| 2 +-
 examples/pom.xml| 2 +-
 external/flume-assembly/pom.xml | 2 +-
 external/flume-sink/pom.xml | 2 +-
 external/flume/pom.xml  | 2 +-
 external/kafka-assembly/pom.xml | 2 +-
 external/kafka/pom.xml  | 2 +-
 external/mqtt-assembly/pom.xml  | 2 +-
 external/mqtt/pom.xml   | 2 +-
 external/twitter/pom.xml| 2 +-
 external/zeromq/pom.xml | 2 +-
 extras/java8-tests/pom.xml  | 2 +-
 extras/kinesis-asl-assembly/pom.xml | 2 +-
 extras/kinesis-asl/pom.xml  | 2 +-
 extras/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml  | 2 +-
 launcher/pom.xml| 2 +-
 mllib/pom.xml   | 2 +-
 network/common/pom.xml  | 2 +-
 network/shuffle/pom.xml | 2 +-
 network/yarn/pom.xml| 2 +-
 pom.xml | 2 +-
 repl/pom.xml| 2 +-
 sql/catalyst/pom.xml| 2 +-
 sql/core/pom.xml| 2 +-
 sql/hive-thriftserver/pom.xml   | 2 +-
 sql/hive/pom.xml| 2 +-
 streaming/pom.xml   | 2 +-
 tags/pom.xml| 2 +-
 tools/pom.xml   | 2 +-
 unsafe/pom.xml  | 2 +-
 yarn/pom.xml| 2 +-
 35 files changed, 35 insertions(+), 35 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 2087e43..87559fd 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1-SNAPSHOT
+1.6.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index 4b4fc24..1050217 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1-SNAPSHOT
+1.6.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index e1b2c1c..1399103 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1-SNAPSHOT
+1.6.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/docker-integration-tests/pom.xml
--
diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml
index 14b6eac..57e64df 100644
--- a/docker-integration-tests/pom.xml
+++ b/docker-integration-tests/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1-SNAPSHOT
+1.6.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index 90cf2b8..a713135 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1-SNAPSHOT
+1.6.1
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index b78f3d4..cbc64bd 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.10
-1.6.1-SNAPSHOT
+1.6.1
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/external/flume-sink/pom.xml
--
diff --gi

[spark] Git Push Summary

2016-02-26 Thread pwendell

Repository: spark
Updated Tags:  refs/tags/v1.6.1-rc1 [created] 15de51c23

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Update CHANGES.txt and spark-ec2 and R package versions for 1.6.1

2016-02-26 Thread joshrosen

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 8a43c3bfb -> eb6f6da48


Update CHANGES.txt and spark-ec2 and R package versions for 1.6.1

This patch updates a few more 1.6.0 version numbers for the 1.6.1 release 
candidate.

Verified this by running

```
git grep "1\.6\.0" | grep -v since | grep -v deprecated | grep -v Since | grep 
-v versionadded | grep 1.6.0
```

and inspecting the output.

Author: Josh Rosen 

Closes #11407 from JoshRosen/version-number-updates.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb6f6da4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb6f6da4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb6f6da4

Branch: refs/heads/branch-1.6
Commit: eb6f6da484b4390c5b196d8426a49609b6a6fc7c
Parents: 8a43c3b
Author: Josh Rosen 
Authored: Fri Feb 26 20:05:44 2016 -0800
Committer: Josh Rosen 
Committed: Fri Feb 26 20:05:44 2016 -0800

--
 CHANGES.txt   | 788 +
 R/pkg/DESCRIPTION |   2 +-
 dev/create-release/generate-changelist.py |   4 +-
 ec2/spark_ec2.py  |   4 +-
 4 files changed, 794 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eb6f6da4/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index ff59371..f66bef9 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,6 +1,794 @@
 Spark Change Log
 
 
+Release 1.6.1
+
+  [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to 
home.apache.org
+  Josh Rosen 
+  2016-02-26 18:40:00 -0800
+  Commit: 8a43c3b, github.com/apache/spark/pull/11350
+
+  [SPARK-13454][SQL] Allow users to drop a table with a name starting with an 
underscore.
+  Yin Huai 
+  2016-02-26 12:34:03 -0800
+  Commit: a57f87e, github.com/apache/spark/pull/11349
+
+  [SPARK-12874][ML] ML StringIndexer does not protect itself from column name 
duplication
+  Yu ISHIKAWA 
+  2016-02-25 13:21:33 -0800
+  Commit: abe8f99, github.com/apache/spark/pull/11370
+
+  Revert "[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large 
DataFrames"
+  Xiangrui Meng 
+  2016-02-25 12:28:03 -0800
+  Commit: d59a08f
+
+  [SPARK-12316] Wait a minutes to avoid cycle calling.
+  huangzhaowei 
+  2016-02-25 09:14:19 -0600
+  Commit: 5f7440b2, github.com/apache/spark/pull/10475
+
+  [SPARK-13439][MESOS] Document that spark.mesos.uris is comma-separated
+  Michael Gummelt 
+  2016-02-25 13:32:09 +
+  Commit: e3802a7, github.com/apache/spark/pull/11311
+
+  [SPARK-13441][YARN] Fix NPE in yarn Client.createConfArchive method
+  Terence Yim 
+  2016-02-25 13:29:30 +
+  Commit: 1f03163, github.com/apache/spark/pull/11337
+
+  [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large 
DataFrames
+  Oliver Pierson , Oliver Pierson 
+  2016-02-25 13:24:46 +
+  Commit: cb869a1, github.com/apache/spark/pull/11319
+
+  [SPARK-13473][SQL] Don't push predicate through project with 
nondeterministic field(s)
+  Cheng Lian 
+  2016-02-25 20:43:03 +0800
+  Commit: 3cc938a, github.com/apache/spark/pull/11348
+
+  [SPARK-13482][MINOR][CONFIGURATION] Make consistency of the configuraiton 
named in TransportConf.
+  huangzhaowei 
+  2016-02-24 23:52:17 -0800
+  Commit: 8975996, github.com/apache/spark/pull/11360
+
+  [SPARK-13475][TESTS][SQL] HiveCompatibilitySuite should still run in PR 
builder even if a PR only changes sql/core
+  Yin Huai 
+  2016-02-24 13:34:53 -0800
+  Commit: fe71cab, github.com/apache/spark/pull/11351
+
+  [SPARK-13390][SQL][BRANCH-1.6] Fix the issue that Iterator.map().toSeq is 
not Serializable
+  Shixiong Zhu 
+  2016-02-24 13:35:36 +
+  Commit: 06f4fce, github.com/apache/spark/pull/11334
+
+  [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns.
+  Franklyn D'souza 
+  2016-02-23 15:34:04 -0800
+  Commit: 573a2c9, github.com/apache/spark/pull/11333
+
+  [SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply
+  Xiangrui Meng 
+  2016-02-22 23:54:21 -0800
+  Commit: 0784e02, github.com/apache/spark/pull/11226
+
+  [SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false) 
fix for branch-1.6
+  Earthson Lu 
+  2016-02-22 23:40:36 -0800
+  Commit: d31854d, github.com/apache/spark/pull/11237
+
+  Preparing development version 1.6.1-SNAPSHOT
+  Patrick Wendell 
+  2016-02-22 18:30:30 -0800
+  Commit: 2902798
+
+  Preparing Spark release v1.6.1-rc1
+  Patrick Wendell 
+  2016-02-22 18:30:24 -0800
+  Commit: 152252f
+
+  Update branch-1.6 for 1.6.1 release
+  Michael Armbrust 
+  2016-02-22 18:25:48 -0800
+  Commit: 40d11d0
+
+  [SPARK-11624][SPARK-11972][SQL] fix commands that need hive to exec
+  Daoyuan Wang 
+  2016-02-22 18:13:32 -080

spark git commit: [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org

2016-02-26 Thread joshrosen

Repository: spark
Updated Branches:
  refs/heads/master ad615291f -> f77dc4e1e


[SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to 
home.apache.org

Due to the people.apache.org -> home.apache.org migration, we need to update 
our packaging scripts to publish artifacts to the new server. Because the new 
server only supports sftp instead of ssh, we need to update the scripts to use 
lftp instead of ssh + rsync.

Author: Josh Rosen 

Closes #11350 from JoshRosen/update-release-scripts-for-apache-home.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f77dc4e1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f77dc4e1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f77dc4e1

Branch: refs/heads/master
Commit: f77dc4e1e202942aa8393fb5d8f492863973fe17
Parents: ad61529
Author: Josh Rosen 
Authored: Fri Feb 26 18:40:00 2016 -0800
Committer: Josh Rosen 
Committed: Fri Feb 26 18:40:00 2016 -0800

--
 dev/create-release/release-build.sh | 60 +++-
 1 file changed, 44 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f77dc4e1/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index 2fd7fcc..c08b6d7 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -23,8 +23,8 @@ usage: release-build.sh 

 Creates build deliverables from a Spark commit.
 
 Top level targets are
-  package: Create binary packages and copy them to people.apache
-  docs: Build docs and copy them to people.apache
+  package: Create binary packages and copy them to home.apache
+  docs: Build docs and copy them to home.apache
   publish-snapshot: Publish snapshot release to Apache snapshots
   publish-release: Publish a release to Apache release repo
 
@@ -64,13 +64,16 @@ for env in ASF_USERNAME ASF_RSA_KEY GPG_PASSPHRASE GPG_KEY; 
do
   fi
 done
 
+# Explicitly set locale in order to make `sort` output consistent across 
machines.
+# See https://stackoverflow.com/questions/28881 for more details.
+export LC_ALL=C
+
 # Commit ref to checkout when building
 GIT_REF=${GIT_REF:-master}
 
 # Destination directory parent on remote server
 REMOTE_PARENT_DIR=${REMOTE_PARENT_DIR:-/home/$ASF_USERNAME/public_html}
 
-SSH="ssh -o ConnectTimeout=300 -o StrictHostKeyChecking=no -i $ASF_RSA_KEY"
 GPG="gpg --no-tty --batch"
 NEXUS_ROOT=https://repository.apache.org/service/local/staging
 NEXUS_PROFILE=d63f592e7eac0 # Profile for Spark staging uploads
@@ -97,7 +100,20 @@ if [ -z "$SPARK_PACKAGE_VERSION" ]; then
 fi
 
 DEST_DIR_NAME="spark-$SPARK_PACKAGE_VERSION"
-USER_HOST="$asf_usern...@people.apache.org"
+
+function LFTP {
+  SSH="ssh -o ConnectTimeout=300 -o StrictHostKeyChecking=no -i $ASF_RSA_KEY"
+  COMMANDS=$(cat

spark git commit: [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org

2016-02-26 Thread joshrosen

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 a57f87ee4 -> 8a43c3bfb


[SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to 
home.apache.org

Due to the people.apache.org -> home.apache.org migration, we need to update 
our packaging scripts to publish artifacts to the new server. Because the new 
server only supports sftp instead of ssh, we need to update the scripts to use 
lftp instead of ssh + rsync.

Author: Josh Rosen 

Closes #11350 from JoshRosen/update-release-scripts-for-apache-home.

(cherry picked from commit f77dc4e1e202942aa8393fb5d8f492863973fe17)
Signed-off-by: Josh Rosen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a43c3bf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a43c3bf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a43c3bf

Branch: refs/heads/branch-1.6
Commit: 8a43c3bfbcd9d6e3876e09363dba604dc7e63dc3
Parents: a57f87e
Author: Josh Rosen 
Authored: Fri Feb 26 18:40:00 2016 -0800
Committer: Josh Rosen 
Committed: Fri Feb 26 18:40:23 2016 -0800

--
 dev/create-release/release-build.sh | 60 +++-
 1 file changed, 44 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8a43c3bf/dev/create-release/release-build.sh
--
diff --git a/dev/create-release/release-build.sh 
b/dev/create-release/release-build.sh
index cb79e9e..2c3af6a 100755
--- a/dev/create-release/release-build.sh
+++ b/dev/create-release/release-build.sh
@@ -23,8 +23,8 @@ usage: release-build.sh 

 Creates build deliverables from a Spark commit.
 
 Top level targets are
-  package: Create binary packages and copy them to people.apache
-  docs: Build docs and copy them to people.apache
+  package: Create binary packages and copy them to home.apache
+  docs: Build docs and copy them to home.apache
   publish-snapshot: Publish snapshot release to Apache snapshots
   publish-release: Publish a release to Apache release repo
 
@@ -64,13 +64,16 @@ for env in ASF_USERNAME ASF_RSA_KEY GPG_PASSPHRASE GPG_KEY; 
do
   fi
 done
 
+# Explicitly set locale in order to make `sort` output consistent across 
machines.
+# See https://stackoverflow.com/questions/28881 for more details.
+export LC_ALL=C
+
 # Commit ref to checkout when building
 GIT_REF=${GIT_REF:-master}
 
 # Destination directory parent on remote server
 REMOTE_PARENT_DIR=${REMOTE_PARENT_DIR:-/home/$ASF_USERNAME/public_html}
 
-SSH="ssh -o ConnectTimeout=300 -o StrictHostKeyChecking=no -i $ASF_RSA_KEY"
 GPG="gpg --no-tty --batch"
 NEXUS_ROOT=https://repository.apache.org/service/local/staging
 NEXUS_PROFILE=d63f592e7eac0 # Profile for Spark staging uploads
@@ -97,7 +100,20 @@ if [ -z "$SPARK_PACKAGE_VERSION" ]; then
 fi
 
 DEST_DIR_NAME="spark-$SPARK_PACKAGE_VERSION"
-USER_HOST="$asf_usern...@people.apache.org"
+
+function LFTP {
+  SSH="ssh -o ConnectTimeout=300 -o StrictHostKeyChecking=no -i $ASF_RSA_KEY"
+  COMMANDS=$(cat

spark git commit: [SPARK-13519][CORE] Driver should tell Executor to stop itself when cleaning executor's state

2016-02-26 Thread andrewor14

Repository: spark
Updated Branches:
  refs/heads/master 1e5fcdf96 -> ad615291f


[SPARK-13519][CORE] Driver should tell Executor to stop itself when cleaning 
executor's state

## What changes were proposed in this pull request?

When the driver removes an executor's state, the connection between the driver 
and the executor may be still alive so that the executor cannot exit 
automatically (E.g., Master will send RemoveExecutor when a work is lost but 
the executor is still alive), so the driver should try to tell the executor to 
stop itself. Otherwise, we will leak an executor.

This PR modified the driver to send `StopExecutor` to the executor when it's 
removed.

## How was this patch tested?

manual test: increase the worker heartbeat interval to force it's always 
timeout and the leak executors are gone.

Author: Shixiong Zhu 

Closes #11399 from zsxwing/SPARK-13519.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad615291
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad615291
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad615291

Branch: refs/heads/master
Commit: ad615291fe76580ee59e3f48f4efe4627a01409d
Parents: 1e5fcdf
Author: Shixiong Zhu 
Authored: Fri Feb 26 15:11:57 2016 -0800
Committer: Andrew Or 
Committed: Fri Feb 26 15:11:57 2016 -0800

--
 .../spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala  | 4 
 1 file changed, 4 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ad615291/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
index 0a5b09d..d151de5 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
@@ -179,6 +179,10 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 context.reply(true)
 
   case RemoveExecutor(executorId, reason) =>
+// We will remove the executor's state and cannot restore it. However, 
the connection
+// between the driver and the executor may be still alive so that the 
executor won't exit
+// automatically, so try to tell the executor to stop itself. See 
SPARK-13519.
+
executorDataMap.get(executorId).foreach(_.executorEndpoint.send(StopExecutor))
 removeExecutor(executorId, reason)
 context.reply(true)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-13505][ML] add python api for MaxAbsScaler

2016-02-26 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 391755dc6 -> 1e5fcdf96


[SPARK-13505][ML] add python api for MaxAbsScaler

## What changes were proposed in this pull request?
After SPARK-13028, we should add Python API for MaxAbsScaler.

## How was this patch tested?
unit test

Author: zlpmichelle 

Closes #11393 from zlpmichelle/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e5fcdf9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e5fcdf9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e5fcdf9

Branch: refs/heads/master
Commit: 1e5fcdf96c0176a11e5f425ba539b6ed629281db
Parents: 391755d
Author: zlpmichelle 
Authored: Fri Feb 26 14:37:44 2016 -0800
Committer: Xiangrui Meng 
Committed: Fri Feb 26 14:37:44 2016 -0800

--
 python/pyspark/ml/feature.py | 75 +++
 1 file changed, 68 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1e5fcdf9/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 67bccfa..369f350 100644
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -28,13 +28,14 @@ from pyspark.mllib.common import inherit_doc
 from pyspark.mllib.linalg import _convert_to_vector
 
 __all__ = ['Binarizer', 'Bucketizer', 'CountVectorizer', 
'CountVectorizerModel', 'DCT',
-   'ElementwiseProduct', 'HashingTF', 'IDF', 'IDFModel', 
'IndexToString', 'MinMaxScaler',
-   'MinMaxScalerModel', 'NGram', 'Normalizer', 'OneHotEncoder', 'PCA', 
'PCAModel',
-   'PolynomialExpansion', 'QuantileDiscretizer', 'RegexTokenizer', 
'RFormula',
-   'RFormulaModel', 'SQLTransformer', 'StandardScaler', 
'StandardScalerModel',
-   'StopWordsRemover', 'StringIndexer', 'StringIndexerModel', 
'Tokenizer',
-   'VectorAssembler', 'VectorIndexer', 'VectorSlicer', 'Word2Vec', 
'Word2VecModel',
-   'ChiSqSelector', 'ChiSqSelectorModel']
+   'ElementwiseProduct', 'HashingTF', 'IDF', 'IDFModel', 
'IndexToString',
+   'MaxAbsScaler', 'MaxAbsScalerModel', 'MinMaxScaler', 
'MinMaxScalerModel',
+   'NGram', 'Normalizer', 'OneHotEncoder', 'PCA', 'PCAModel', 
'PolynomialExpansion',
+   'QuantileDiscretizer', 'RegexTokenizer', 'RFormula', 
'RFormulaModel',
+   'SQLTransformer', 'StandardScaler', 'StandardScalerModel', 
'StopWordsRemover',
+   'StringIndexer', 'StringIndexerModel', 'Tokenizer', 
'VectorAssembler',
+   'VectorIndexer', 'VectorSlicer', 'Word2Vec', 'Word2VecModel', 
'ChiSqSelector',
+   'ChiSqSelectorModel']
 
 
 @inherit_doc
@@ -545,6 +546,66 @@ class IDFModel(JavaModel):
 
 
 @inherit_doc
+class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol):
+"""
+.. note:: Experimental
+
+Rescale each feature individually to range [-1, 1] by dividing through the 
largest maximum
+absolute value in each feature. It does not shift/center the data, and 
thus does not destroy
+any sparsity.
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> df = sqlContext.createDataFrame([(Vectors.dense([1.0]),), 
(Vectors.dense([2.0]),)], ["a"])
+>>> maScaler = MaxAbsScaler(inputCol="a", outputCol="scaled")
+>>> model = maScaler.fit(df)
+>>> model.transform(df).show()
++-+--+
+|a|scaled|
++-+--+
+|[1.0]| [0.5]|
+|[2.0]| [1.0]|
++-+--+
+...
+
+.. versionadded:: 2.0.0
+"""
+
+@keyword_only
+def __init__(self, inputCol=None, outputCol=None):
+"""
+__init__(self, inputCol=None, outputCol=None)
+"""
+super(MaxAbsScaler, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.MaxAbsScaler", self.uid)
+self._setDefault()
+kwargs = self.__init__._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.0.0")
+def setParams(self, inputCol=None, outputCol=None):
+"""
+setParams(self, inputCol=None, outputCol=None)
+Sets params for this MaxAbsScaler.
+"""
+kwargs = self.setParams._input_kwargs
+return self._set(**kwargs)
+
+def _create_model(self, java_model):
+return MaxAbsScalerModel(java_model)
+
+
+class MaxAbsScalerModel(JavaModel):
+"""
+.. note:: Experimental
+
+Model fitted by :py:class:`MaxAbsScaler`.
+
+.. versionadded:: 2.0.0
+"""
+
+
+@inherit_doc
 class MinMaxScaler(JavaEstimator, HasInputCol, HasOutputCol):
 """
 .. note:: Experimental


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional

[spark] Git Push Summary

2016-02-26 Thread marmbrus

Repository: spark
Updated Tags:  refs/tags/v1.6.1-rc1 [deleted] 152252f15

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-13465] Add a task failure listener to TaskContext

2016-02-26 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master 0598a2b81 -> 391755dc6


[SPARK-13465] Add a task failure listener to TaskContext

## What changes were proposed in this pull request?

TaskContext supports task completion callback, which gets called regardless of 
task failures. However, there is no way for the listener to know if there is an 
error. This patch adds a new listener that gets called when a task fails.

## How was the this patch tested?
New unit test case and integration test case covering the code path

Author: Reynold Xin 

Closes #11340 from rxin/SPARK-13465.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/391755dc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/391755dc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/391755dc

Branch: refs/heads/master
Commit: 391755dc6ed2e156b8df8a530ac8df6ed7ba7f8a
Parents: 0598a2b
Author: Reynold Xin 
Authored: Fri Feb 26 12:49:16 2016 -0800
Committer: Davies Liu 
Committed: Fri Feb 26 12:49:16 2016 -0800

--
 .../scala/org/apache/spark/TaskContext.scala| 28 +++-
 .../org/apache/spark/TaskContextImpl.scala  | 33 --
 .../scala/org/apache/spark/scheduler/Task.scala |  5 ++
 .../spark/util/TaskCompletionListener.scala | 33 --
 .../util/TaskCompletionListenerException.scala  | 34 --
 .../org/apache/spark/util/taskListeners.scala   | 68 
 .../spark/JavaTaskCompletionListenerImpl.java   | 39 ---
 .../spark/JavaTaskContextCompileCheck.java  | 30 +
 .../spark/scheduler/TaskContextSuite.scala  | 44 -
 project/MimaExcludes.scala  |  4 +-
 10 files changed, 201 insertions(+), 117 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/391755dc/core/src/main/scala/org/apache/spark/TaskContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/TaskContext.scala 
b/core/src/main/scala/org/apache/spark/TaskContext.scala
index 9f49cf1..bfcacbf 100644
--- a/core/src/main/scala/org/apache/spark/TaskContext.scala
+++ b/core/src/main/scala/org/apache/spark/TaskContext.scala
@@ -23,7 +23,7 @@ import org.apache.spark.annotation.DeveloperApi
 import org.apache.spark.executor.TaskMetrics
 import org.apache.spark.memory.TaskMemoryManager
 import org.apache.spark.metrics.source.Source
-import org.apache.spark.util.TaskCompletionListener
+import org.apache.spark.util.{TaskCompletionListener, TaskFailureListener}
 
 
 object TaskContext {
@@ -106,6 +106,8 @@ abstract class TaskContext extends Serializable {
* Adds a (Java friendly) listener to be executed on task completion.
* This will be called in all situation - success, failure, or cancellation.
* An example use is for HadoopRDD to register a callback to close the input 
stream.
+   *
+   * Exceptions thrown by the listener will result in failure of the task.
*/
   def addTaskCompletionListener(listener: TaskCompletionListener): TaskContext
 
@@ -113,8 +115,30 @@ abstract class TaskContext extends Serializable {
* Adds a listener in the form of a Scala closure to be executed on task 
completion.
* This will be called in all situations - success, failure, or cancellation.
* An example use is for HadoopRDD to register a callback to close the input 
stream.
+   *
+   * Exceptions thrown by the listener will result in failure of the task.
*/
-  def addTaskCompletionListener(f: (TaskContext) => Unit): TaskContext
+  def addTaskCompletionListener(f: (TaskContext) => Unit): TaskContext = {
+addTaskCompletionListener(new TaskCompletionListener {
+  override def onTaskCompletion(context: TaskContext): Unit = f(context)
+})
+  }
+
+  /**
+   * Adds a listener to be executed on task failure.
+   * Operations defined here must be idempotent, as `onTaskFailure` can be 
called multiple times.
+   */
+  def addTaskFailureListener(listener: TaskFailureListener): TaskContext
+
+  /**
+   * Adds a listener to be executed on task failure.
+   * Operations defined here must be idempotent, as `onTaskFailure` can be 
called multiple times.
+   */
+  def addTaskFailureListener(f: (TaskContext, Throwable) => Unit): TaskContext 
= {
+addTaskFailureListener(new TaskFailureListener {
+  override def onTaskFailure(context: TaskContext, error: Throwable): Unit 
= f(context, error)
+})
+  }
 
   /**
* The ID of the stage that this task belong to.

http://git-wip-us.apache.org/repos/asf/spark/blob/391755dc/core/src/main/scala/org/apache/spark/TaskContextImpl.scala
--
diff --git a/core/src/main/scala/org/apache/spark/TaskContextImpl.scala 
b/core/src/main/scala/org/apache/spark/TaskContextImpl.scala
inde

spark git commit: [SPARK-13499] [SQL] Performance improvements for parquet reader.

2016-02-26 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master 6df1e55a6 -> 0598a2b81


[SPARK-13499] [SQL] Performance improvements for parquet reader.

## What changes were proposed in this pull request?

This patch includes these performance fixes:
  - Remove unnecessary setNotNull() calls. The NULL bits are cleared already.
  - Speed up RLE group decoding
  - Speed up dictionary decoding by decoding NULLs directly into the result.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

In addition to the updated benchmarks, on TPCDS, the result of these changes
running Q55 (sf40) is:

```
TPCDS: Best/Avg Time(ms)Rate(M/s)   Per Row(ns)
-
q55 (Before) 6398 / 6616 18.0  55.5
q55 (After)  4983 / 5189 23.1  43.3
```

Author: Nong Li 

Closes #11375 from nongli/spark-13499.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0598a2b8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0598a2b8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0598a2b8

Branch: refs/heads/master
Commit: 0598a2b81d1426dd2cf9e6fc32cef345364d18c6
Parents: 6df1e55
Author: Nong Li 
Authored: Fri Feb 26 12:43:50 2016 -0800
Committer: Davies Liu 
Committed: Fri Feb 26 12:43:50 2016 -0800

--
 .../parquet/UnsafeRowParquetRecordReader.java   | 17 +
 .../parquet/VectorizedRleValuesReader.java  | 66 
 .../parquet/ParquetReadBenchmark.scala  | 30 -
 3 files changed, 59 insertions(+), 54 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0598a2b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
index 4576ac2..9d50cfa 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java
@@ -628,7 +628,8 @@ public class UnsafeRowParquetRecordReader extends 
SpecificParquetRecordReaderBas
 dictionaryIds.reserve(total);
   }
   // Read and decode dictionary ids.
-  readIntBatch(rowId, num, dictionaryIds);
+  defColumn.readIntegers(
+  num, dictionaryIds, column, rowId, maxDefLevel, 
(VectorizedValuesReader) dataColumn);
   decodeDictionaryIds(rowId, num, column);
 } else {
   switch (descriptor.getType()) {
@@ -739,18 +740,6 @@ public class UnsafeRowParquetRecordReader extends 
SpecificParquetRecordReaderBas
 default:
   throw new NotImplementedException("Unsupported type: " + 
descriptor.getType());
   }
-
-  if (dictionaryIds.numNulls() > 0) {
-// Copy the NULLs over.
-// TODO: we can improve this by decoding the NULLs directly into 
column. This would
-// mean we decode the int ids into `dictionaryIds` and the NULLs into 
`column` and then
-// just do the ID remapping as above.
-for (int i = 0; i < num; ++i) {
-  if (dictionaryIds.getIsNull(rowId + i)) {
-column.putNull(rowId + i);
-  }
-}
-  }
 }
 
 /**
@@ -769,7 +758,7 @@ public class UnsafeRowParquetRecordReader extends 
SpecificParquetRecordReaderBas
   // TODO: implement remaining type conversions
   if (column.dataType() == DataTypes.IntegerType || column.dataType() == 
DataTypes.DateType) {
 defColumn.readIntegers(
-num, column, rowId, maxDefLevel, (VectorizedValuesReader) 
dataColumn, 0);
+num, column, rowId, maxDefLevel, (VectorizedValuesReader) 
dataColumn);
   } else if (column.dataType() == DataTypes.ByteType) {
 defColumn.readBytes(
 num, column, rowId, maxDefLevel, (VectorizedValuesReader) 
dataColumn);

http://git-wip-us.apache.org/repos/asf/spark/blob/0598a2b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
--
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
index 629959a..

spark git commit: [SPARK-13454][SQL] Allow users to drop a table with a name starting with an underscore.

2016-02-26 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 abe8f991a -> a57f87ee4


[SPARK-13454][SQL] Allow users to drop a table with a name starting with an 
underscore.

## What changes were proposed in this pull request?

This change adds a workaround to allow users to drop a table with a name 
starting with an underscore. Without this patch, we can create such a table, 
but we cannot drop it. The reason is that Hive's parser unquote an quoted 
identifier (see 
https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g#L453).
 So, when we issue a drop table command to Hive, a table name starting with an 
underscore is actually not quoted. Then, Hive will complain about it because it 
does not support a table name starting with an underscore without using 
backticks (underscores are allowed as long as it is not the first char though).

## How was this patch tested?

Add a test to make sure we can drop a table with a name starting with an 
underscore.

https://issues.apache.org/jira/browse/SPARK-13454

Author: Yin Huai 

Closes #11349 from yhuai/fixDropTable.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a57f87ee
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a57f87ee
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a57f87ee

Branch: refs/heads/branch-1.6
Commit: a57f87ee4aafdb97c15f4076e20034ea34c7e2e5
Parents: abe8f99
Author: Yin Huai 
Authored: Fri Feb 26 12:34:03 2016 -0800
Committer: Yin Huai 
Committed: Fri Feb 26 12:34:03 2016 -0800

--
 .../spark/sql/hive/execution/commands.scala | 20 +--
 .../sql/hive/HiveMetastoreCatalogSuite.scala| 21 
 2 files changed, 39 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a57f87ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
index 94210a5..6b16d59 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.hive.execution
 import org.apache.hadoop.hive.metastore.MetaStoreUtils
 
 import org.apache.spark.sql._
-import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.{SqlParser, TableIdentifier}
 import org.apache.spark.sql.catalyst.analysis.EliminateSubQueries
 import org.apache.spark.sql.catalyst.expressions.Attribute
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
@@ -70,7 +70,23 @@ case class DropTable(
   case e: Throwable => log.warn(s"${e.getMessage}", e)
 }
 hiveContext.invalidateTable(tableName)
-hiveContext.runSqlHive(s"DROP TABLE $ifExistsClause$tableName")
+val tableNameForHive = {
+  // Hive's parser will unquote an identifier (see the rule of 
QuotedIdentifier in
+  // HiveLexer.g of Hive 1.2.1). For the DROP TABLE command that we pass 
in Hive, we
+  // will use the quoted form (db.tableName) if the table name starts with 
a _.
+  // Otherwise, we keep the unquoted form (`db`.`tableName`), which is the 
same as tableName
+  // passed into this DropTable class. Please note that although 
QuotedIdentifier rule
+  // allows backticks appearing in an identifier, Hive does not actually 
allow such
+  // an identifier be a table name. So, we do not check if a table name 
part has
+  // any backtick or not.
+  //
+  // This change is at here because this patch is just for 1.6 branch and 
we try to
+  // avoid of affecting normal cases (tables do not use _ as the first 
character of
+  // their name).
+  val identifier = SqlParser.parseTableIdentifier(tableName)
+  if (identifier.table.startsWith("_")) identifier.quotedString else 
identifier.unquotedString
+}
+hiveContext.runSqlHive(s"DROP TABLE $ifExistsClause$tableNameForHive")
 hiveContext.catalog.unregisterTable(TableIdentifier(tableName))
 Seq.empty[Row]
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/a57f87ee/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
index d63f3d3..15fd96a 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala
+++ 
b/sql/hive/src/test/scala/or

spark git commit: [SPARK-12313] [SQL] improve performance of BroadcastNestedLoopJoin

2016-02-26 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master 727e78014 -> 6df1e55a6


[SPARK-12313] [SQL] improve performance of BroadcastNestedLoopJoin

## What changes were proposed in this pull request?

Currently, BroadcastNestedLoopJoin is implemented for worst case, it's too 
slow, very easy to hang forever. This PR will create fast path for some 
joinType and buildSide, also improve the worst case (will use much less memory 
than before).

Before this PR, one task requires O(N*K) + O(K) in worst cases, N is number of 
rows from one partition of streamed table, it could hang the job (because of 
GC).

In order to workaround this for InnerJoin, we have to disable auto-broadcast, 
switch to CartesianProduct: This could be workaround for InnerJoin, see 
https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html

In this PR, we will have fast path for these joins :

 InnerJoin with BuildLeft or BuildRight
 LeftOuterJoin with BuildRight
 RightOuterJoin with BuildLeft
 LeftSemi with BuildRight

These fast paths are all stream based (take one pass on streamed table), 
required O(1) memory.

All other join types and build types will take two pass on streamed table, one 
pass to find the matched rows that includes streamed part, which require O(1) 
memory, another pass to find the rows from build table that does not have a 
matched row from streamed table, which required O(K) memory, K is the number 
rows from build side, one bit per row, should be much smaller than the memory 
for broadcast. The following join types work in this way:

LeftOuterJoin with BuildLeft
RightOuterJoin with BuildRight
FullOuterJoin with BuildLeft or BuildRight
LeftSemi with BuildLeft

This PR also added tests for all the join types for BroadcastNestedLoopJoin.

After this PR, for InnerJoin with one small table, BroadcastNestedLoopJoin 
should be faster than CartesianProduct, we don't need that workaround anymore.

## How was the this patch tested?

Added unit tests.

Author: Davies Liu 

Closes #11328 from davies/nested_loop.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6df1e55a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6df1e55a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6df1e55a

Branch: refs/heads/master
Commit: 6df1e55a6594ae4bc7882f44af8d230aad9489b4
Parents: 727e780
Author: Davies Liu 
Authored: Fri Feb 26 09:58:05 2016 -0800
Committer: Davies Liu 
Committed: Fri Feb 26 09:58:05 2016 -0800

--
 .../catalyst/plans/physical/broadcastMode.scala |   1 +
 .../spark/sql/execution/SparkStrategies.scala   |  14 +-
 .../joins/BroadcastNestedLoopJoin.scala | 295 ++-
 .../scala/org/apache/spark/sql/JoinSuite.scala  |  11 +-
 .../sql/execution/joins/InnerJoinSuite.scala|  27 ++
 .../sql/execution/joins/OuterJoinSuite.scala|  18 ++
 .../sql/execution/joins/SemiJoinSuite.scala |  20 +-
 7 files changed, 295 insertions(+), 91 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6df1e55a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala
index c646dcf..e01f69f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala
@@ -31,5 +31,6 @@ trait BroadcastMode {
  * IdentityBroadcastMode requires that rows are broadcasted in their original 
form.
  */
 case object IdentityBroadcastMode extends BroadcastMode {
+  // TODO: pack the UnsafeRows into single bytes array.
   override def transform(rows: Array[InternalRow]): Array[InternalRow] = rows
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/6df1e55a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
index 5fdf38c..dd8c96d 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
@@ -253,22 +253,19 @@ private[sql] abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
 
   object BroadcastNestedLoop extends Strategy {
 def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
-  case logical.Join(
- CanBroad

spark git commit: [MINOR][SQL] Fix modifier order.

2016-02-26 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 7af0de076 -> 727e78014


[MINOR][SQL] Fix modifier order.

## What changes were proposed in this pull request?

This PR fixes the order of modifier from `abstract public` into `public 
abstract`.
Currently, when we run `./dev/lint-java`, it shows the error.
```
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/util/sketch/CountMinSketch.java:[53,10] 
(modifier) ModifierOrder: 'public' modifier out of order with the JLS 
suggestions.
```

## How was this patch tested?

```
$ ./dev/lint-java
Checkstyle checks passed.
```

Author: Dongjoon Hyun 

Closes #11390 from dongjoon-hyun/fix_modifier_order.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/727e7801
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/727e7801
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/727e7801

Branch: refs/heads/master
Commit: 727e78014fd4957e477d62adc977fa4da3e3455d
Parents: 7af0de0
Author: Dongjoon Hyun 
Authored: Fri Feb 26 17:11:19 2016 +
Committer: Sean Owen 
Committed: Fri Feb 26 17:11:19 2016 +

--
 .../src/main/java/org/apache/spark/util/sketch/CountMinSketch.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/727e7801/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java
--
diff --git 
a/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java 
b/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java
index 2c9aa93..40fa20c 100644
--- 
a/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java
+++ 
b/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java
@@ -50,7 +50,7 @@ import java.io.OutputStream;
  *
  * This implementation is largely based on the {@code CountMinSketch} class 
from stream-lib.
  */
-abstract public class CountMinSketch {
+public abstract class CountMinSketch {
 
   public enum Version {
 /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-11381][DOCS] Replace example code in mllib-linear-methods.md using include_example

2016-02-26 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master b33261f91 -> 7af0de076


[SPARK-11381][DOCS] Replace example code in mllib-linear-methods.md using 
include_example

## What changes were proposed in this pull request?

This PR replaces example codes in `mllib-linear-methods.md` using 
`include_example`
by doing the followings:
  * Extracts the example codes(Scala,Java,Python) as files in `example` module.
  * Merges some dialog-style examples into a single file.
  * Hide redundant codes in HTML for the consistency with other docs.

## How was the this patch tested?

manual test.
This PR can be tested by document generations, `SKIP_API=1 jekyll build`.

Author: Dongjoon Hyun 

Closes #11320 from dongjoon-hyun/SPARK-11381.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7af0de07
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7af0de07
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7af0de07

Branch: refs/heads/master
Commit: 7af0de076f74e975c9235c88b0f11b22fcbae060
Parents: b33261f
Author: Dongjoon Hyun 
Authored: Fri Feb 26 08:31:55 2016 -0800
Committer: Xiangrui Meng 
Committed: Fri Feb 26 08:31:55 2016 -0800

--
 docs/mllib-linear-methods.md| 441 ++-
 .../JavaLinearRegressionWithSGDExample.java |  94 
 .../JavaLogisticRegressionWithLBFGSExample.java |  79 
 .../examples/mllib/JavaSVMWithSGDExample.java   |  82 
 .../mllib/linear_regression_with_sgd_example.py |  54 +++
 .../logistic_regression_with_lbfgs_example.py   |  54 +++
 .../streaming_linear_regression_example.py  |  62 +++
 .../main/python/mllib/svm_with_sgd_example.py   |  47 ++
 .../mllib/LinearRegressionWithSGDExample.scala  |  64 +++
 .../LogisticRegressionWithLBFGSExample.scala|  69 +++
 .../examples/mllib/SVMWithSGDExample.scala  |  70 +++
 .../StreamingLinearRegressionExample.scala  |  58 +++
 12 files changed, 758 insertions(+), 416 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7af0de07/docs/mllib-linear-methods.md
--
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index aac8f75..63665c4 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -170,42 +170,7 @@ error.
 
 Refer to the [`SVMWithSGD` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD) 
and [`SVMModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMModel) for 
details on the API.
 
-{% highlight scala %}
-import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}
-import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
-import org.apache.spark.mllib.util.MLUtils
-
-// Load training data in LIBSVM format.
-val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
-
-// Split data into training (60%) and test (40%).
-val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
-val training = splits(0).cache()
-val test = splits(1)
-
-// Run training algorithm to build the model
-val numIterations = 100
-val model = SVMWithSGD.train(training, numIterations)
-
-// Clear the default threshold.
-model.clearThreshold()
-
-// Compute raw scores on the test set.
-val scoreAndLabels = test.map { point =>
-  val score = model.predict(point.features)
-  (score, point.label)
-}
-
-// Get evaluation metrics.
-val metrics = new BinaryClassificationMetrics(scoreAndLabels)
-val auROC = metrics.areaUnderROC()
-
-println("Area under ROC = " + auROC)
-
-// Save and load model
-model.save(sc, "myModelPath")
-val sameModel = SVMModel.load(sc, "myModelPath")
-{% endhighlight %}
+{% include_example 
scala/org/apache/spark/examples/mllib/SVMWithSGDExample.scala %}
 
 The `SVMWithSGD.train()` method by default performs L2 regularization with the
 regularization parameter set to 1.0. If we want to configure this algorithm, we
@@ -216,6 +181,7 @@ variant of SVMs with regularization parameter set to 0.1, 
and runs the training
 algorithm for 200 iterations.
 
 {% highlight scala %}
+
 import org.apache.spark.mllib.optimization.L1Updater
 
 val svmAlg = new SVMWithSGD()
@@ -237,61 +203,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the [`SVMWithSGD` Java 
docs](api/java/org/apache/spark/mllib/classification/SVMWithSGD.html) and 
[`SVMModel` Java 
docs](api/java/org/apache/spark/mllib/classification/SVMModel.html) for details 
on the API.
 
-{% highlight java %}
-import scala.Tuple2;
-
-import org.apache.spark.api.java.*;
-import org.apache.spark.api.java.function.Function;
-import org.apache.spark.mllib.classification.*;
-import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics;
-
-import org.apache.spark.mllib.regression.LabeledP

spark git commit: [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format

2016-02-26 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 99dfcedbf -> b33261f91


[SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format

Part of task for 
[SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make 
PySpark MLlib parameter description formatting consistent.  This is for the 
tree module.

closes #10601

Author: Bryan Cutler 
Author: vijaykiran 

Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b33261f9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b33261f9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b33261f9

Branch: refs/heads/master
Commit: b33261f91387904c5aaccae40f86922c92a4e09a
Parents: 99dfced
Author: Bryan Cutler 
Authored: Fri Feb 26 08:30:32 2016 -0800
Committer: Xiangrui Meng 
Committed: Fri Feb 26 08:30:32 2016 -0800

--
 docs/mllib-decision-tree.md |   6 +-
 .../apache/spark/mllib/tree/DecisionTree.scala  | 172 +-
 .../spark/mllib/tree/GradientBoostedTrees.scala |  18 +-
 .../apache/spark/mllib/tree/RandomForest.scala  |  69 ++--
 .../mllib/tree/configuration/Strategy.scala |  13 +-
 python/pyspark/mllib/tree.py| 339 +++
 6 files changed, 340 insertions(+), 277 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b33261f9/docs/mllib-decision-tree.md
--
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index a8612b6..9af4835 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -121,12 +121,12 @@ The parameters are listed below roughly in order of 
descending importance.  New
 These parameters describe the problem you want to solve and your dataset.
 They should be specified and do not require tuning.
 
-* **`algo`**: `Classification` or `Regression`
+* **`algo`**: Type of decision tree, either `Classification` or `Regression`.
 
-* **`numClasses`**: Number of classes (for `Classification` only)
+* **`numClasses`**: Number of classes (for `Classification` only).
 
 * **`categoricalFeaturesInfo`**: Specifies which features are categorical and 
how many categorical values each of those features can take.  This is given as 
a map from feature indices to feature arity (number of categories).  Any 
features not in this map are treated as continuous.
-  * E.g., `Map(0 -> 2, 4 -> 10)` specifies that feature `0` is binary (taking 
values `0` or `1`) and that feature `4` has 10 categories (values `{0, 1, ..., 
9}`).  Note that feature indices are 0-based: features `0` and `4` are the 1st 
and 5th elements of an instance's feature vector.
+  * For example, `Map(0 -> 2, 4 -> 10)` specifies that feature `0` is binary 
(taking values `0` or `1`) and that feature `4` has 10 categories (values `{0, 
1, ..., 9}`).  Note that feature indices are 0-based: features `0` and `4` are 
the 1st and 5th elements of an instance's feature vector.
   * Note that you do not have to specify `categoricalFeaturesInfo`.  The 
algorithm will still run and may get reasonable results.  However, performance 
should be better if categorical features are properly designated.
 
 ### Stopping criteria

http://git-wip-us.apache.org/repos/asf/spark/blob/b33261f9/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
index 51235a2..40440d5 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@@ -38,8 +38,9 @@ import org.apache.spark.util.random.XORShiftRandom
 /**
  * A class which implements a decision tree learning algorithm for 
classification and regression.
  * It supports both continuous and categorical features.
+ *
  * @param strategy The configuration parameters for the tree algorithm which 
specify the type
- * of algorithm (classification, regression, etc.), feature 
type (continuous,
+ * of decision tree (classification or regression), feature 
type (continuous,
  * categorical), depth of the tree, quantile calculation 
strategy, etc.
  */
 @Since("1.0.0")
@@ -50,8 +51,8 @@ class DecisionTree @Since("1.0.0") (private val strategy: 
Strategy)
 
   /**
* Method to train a decision tree model over an RDD
-   * @param input Training data: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]]
-   * @return DecisionTreeModel that can be used for prediction
+   * @param input Training data: RDD of 
[[org.ap

spark git commit: [SPARK-13457][SQL] Removes DataFrame RDD operations

2016-02-26 Thread lian

Repository: spark
Updated Branches:
  refs/heads/master 5c3912e5c -> 99dfcedbf


[SPARK-13457][SQL] Removes DataFrame RDD operations

## What changes were proposed in this pull request?

This is another try of PR #11323.

This PR removes DataFrame RDD operations except for `foreach` and 
`foreachPartitions` (they are actions rather than transformations). Original 
calls are now replaced by calls to methods of `DataFrame.rdd`.

PR #11323 was reverted because it introduced a regression: both 
`DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD 
operations with `withNewExecutionId` to track Spark jobs. But they are removed 
in #11323.

## How was the this patch tested?

No extra tests are added. Existing tests should do the work.

Author: Cheng Lian 

Closes #11388 from liancheng/remove-df-rdd-ops.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99dfcedb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99dfcedb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99dfcedb

Branch: refs/heads/master
Commit: 99dfcedbfd4c83c7b6a343456f03e8c6e29968c5
Parents: 5c3912e
Author: Cheng Lian 
Authored: Sat Feb 27 00:28:30 2016 +0800
Committer: Cheng Lian 
Committed: Sat Feb 27 00:28:30 2016 +0800

--
 .../spark/examples/ml/DataFrameExample.scala|  2 +-
 .../spark/examples/ml/DecisionTreeExample.scala |  8 +++
 .../spark/examples/ml/OneVsRestExample.scala|  2 +-
 .../spark/examples/mllib/LDAExample.scala   |  1 +
 .../apache/spark/examples/sql/RDDRelation.scala |  2 +-
 .../spark/examples/sql/hive/HiveFromSpark.scala |  2 +-
 .../scala/org/apache/spark/ml/Predictor.scala   |  6 +++--
 .../ml/classification/LogisticRegression.scala  | 13 +++
 .../spark/ml/clustering/BisectingKMeans.scala   |  4 ++--
 .../org/apache/spark/ml/clustering/KMeans.scala |  6 ++---
 .../org/apache/spark/ml/clustering/LDA.scala|  1 +
 .../BinaryClassificationEvaluator.scala |  9 
 .../MulticlassClassificationEvaluator.scala |  6 ++---
 .../ml/evaluation/RegressionEvaluator.scala |  3 ++-
 .../apache/spark/ml/feature/ChiSqSelector.scala |  2 +-
 .../spark/ml/feature/CountVectorizer.scala  |  2 +-
 .../scala/org/apache/spark/ml/feature/IDF.scala |  2 +-
 .../apache/spark/ml/feature/MaxAbsScaler.scala  |  2 +-
 .../apache/spark/ml/feature/MinMaxScaler.scala  |  2 +-
 .../apache/spark/ml/feature/OneHotEncoder.scala |  2 +-
 .../scala/org/apache/spark/ml/feature/PCA.scala |  2 +-
 .../spark/ml/feature/StandardScaler.scala   |  2 +-
 .../apache/spark/ml/feature/StringIndexer.scala |  1 +
 .../apache/spark/ml/feature/VectorIndexer.scala |  2 +-
 .../org/apache/spark/ml/feature/Word2Vec.scala  |  2 +-
 .../apache/spark/ml/recommendation/ALS.scala|  1 +
 .../ml/regression/AFTSurvivalRegression.scala   |  2 +-
 .../ml/regression/IsotonicRegression.scala  |  6 ++---
 .../spark/ml/regression/LinearRegression.scala  | 16 -
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  8 +++
 .../spark/mllib/clustering/KMeansModel.scala|  2 +-
 .../spark/mllib/clustering/LDAModel.scala   |  4 ++--
 .../clustering/PowerIterationClustering.scala   |  2 +-
 .../BinaryClassificationMetrics.scala   |  2 +-
 .../mllib/evaluation/MulticlassMetrics.scala|  2 +-
 .../mllib/evaluation/MultilabelMetrics.scala|  4 +++-
 .../mllib/evaluation/RegressionMetrics.scala|  2 +-
 .../spark/mllib/feature/ChiSqSelector.scala |  2 +-
 .../org/apache/spark/mllib/fpm/FPGrowth.scala   |  2 +-
 .../MatrixFactorizationModel.scala  | 12 +-
 .../mllib/tree/model/DecisionTreeModel.scala|  2 +-
 .../mllib/tree/model/treeEnsembleModels.scala   |  2 +-
 .../LogisticRegressionSuite.scala   |  2 +-
 .../MultilayerPerceptronClassifierSuite.scala   |  5 ++--
 .../ml/classification/OneVsRestSuite.scala  |  6 ++---
 .../ml/clustering/BisectingKMeansSuite.scala|  3 ++-
 .../spark/ml/clustering/KMeansSuite.scala   |  3 ++-
 .../apache/spark/ml/clustering/LDASuite.scala   |  2 +-
 .../spark/ml/feature/OneHotEncoderSuite.scala   |  4 ++--
 .../spark/ml/feature/StringIndexerSuite.scala   |  6 ++---
 .../spark/ml/feature/VectorIndexerSuite.scala   |  5 ++--
 .../apache/spark/ml/feature/Word2VecSuite.scala |  8 +++
 .../spark/ml/recommendation/ALSSuite.scala  |  7 +++---
 .../spark/ml/regression/GBTRegressorSuite.scala |  2 +-
 .../ml/regression/IsotonicRegressionSuite.scala |  6 ++---
 .../ml/regression/LinearRegressionSuite.scala   | 17 +++---
 .../scala/org/apache/spark/sql/DataFrame.scala  | 24 
 .../org/apache/spark/sql/GroupedData.scala  |  1 +
 .../org/apache/spark/sql/api/r/SQLUtils.scala   |  2 +-
 .../spark/sql/DataFrameAggregateSuite.scala |  2 +-
 .../parquet/ParquetFilterSuite.scala|  2 +-
 .../datasources/parquet/Par

spark git commit: [SPARK-12523][YARN] Support long-running of the Spark On HBase and hive meta store.

2016-02-26 Thread tgraves

Repository: spark
Updated Branches:
  refs/heads/master 318bf4115 -> 5c3912e5c


[SPARK-12523][YARN] Support long-running of the Spark On HBase and hive meta 
store.

Obtain the hive metastore and hbase token as well as hdfs token in 
DelegationToeknRenewer to supoort long-running application of spark on hbase or 
thriftserver.

Author: huangzhaowei 

Closes #10645 from SaintBacchus/SPARK-12523.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c3912e5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c3912e5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c3912e5

Branch: refs/heads/master
Commit: 5c3912e5c90ce659146c3056430d100604378b71
Parents: 318bf41
Author: huangzhaowei 
Authored: Fri Feb 26 07:32:07 2016 -0600
Committer: Tom Graves 
Committed: Fri Feb 26 07:32:07 2016 -0600

--
 .../deploy/yarn/AMDelegationTokenRenewer.scala  |  2 +
 .../org/apache/spark/deploy/yarn/Client.scala   | 42 +---
 .../spark/deploy/yarn/YarnSparkHadoopUtil.scala | 38 ++
 3 files changed, 42 insertions(+), 40 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5c3912e5/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
--
diff --git 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
index 2ac9e33..70b67d2 100644
--- 
a/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
+++ 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala
@@ -172,6 +172,8 @@ private[yarn] class AMDelegationTokenRenewer(
   override def run(): Void = {
 val nns = YarnSparkHadoopUtil.get.getNameNodesToAccess(sparkConf) + dst
 hadoopUtil.obtainTokensForNamenodes(nns, freshHadoopConf, tempCreds)
+hadoopUtil.obtainTokenForHiveMetastore(sparkConf, freshHadoopConf, 
tempCreds)
+hadoopUtil.obtainTokenForHBase(sparkConf, freshHadoopConf, tempCreds)
 null
   }
 })

http://git-wip-us.apache.org/repos/asf/spark/blob/5c3912e5/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
--
diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index 530f1d7..dac3ea2 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -345,8 +345,8 @@ private[spark] class Client(
 // multiple times, YARN will fail to launch containers for the app with an 
internal
 // error.
 val distributedUris = new HashSet[String]
-obtainTokenForHiveMetastore(sparkConf, hadoopConf, credentials)
-obtainTokenForHBase(sparkConf, hadoopConf, credentials)
+YarnSparkHadoopUtil.get.obtainTokenForHiveMetastore(sparkConf, hadoopConf, 
credentials)
+YarnSparkHadoopUtil.get.obtainTokenForHBase(sparkConf, hadoopConf, 
credentials)
 
 val replication = sparkConf.getInt("spark.yarn.submit.file.replication",
   fs.getDefaultReplication(dst)).toShort
@@ -1358,35 +1358,6 @@ object Client extends Logging {
   }
 
   /**
-   * Obtains token for the Hive metastore and adds them to the credentials.
-   */
-  private def obtainTokenForHiveMetastore(
-  sparkConf: SparkConf,
-  conf: Configuration,
-  credentials: Credentials) {
-if (shouldGetTokens(sparkConf, "hive") && 
UserGroupInformation.isSecurityEnabled) {
-  YarnSparkHadoopUtil.get.obtainTokenForHiveMetastore(conf).foreach {
-credentials.addToken(new Text("hive.server2.delegation.token"), _)
-  }
-}
-  }
-
-  /**
-   * Obtain a security token for HBase.
-   */
-  def obtainTokenForHBase(
-  sparkConf: SparkConf,
-  conf: Configuration,
-  credentials: Credentials): Unit = {
-if (shouldGetTokens(sparkConf, "hbase") && 
UserGroupInformation.isSecurityEnabled) {
-  YarnSparkHadoopUtil.get.obtainTokenForHBase(conf).foreach { token =>
-credentials.addToken(token.getService, token)
-logInfo("Added HBase security token to credentials.")
-  }
-}
-  }
-
-  /**
* Return whether the two file systems are the same.
*/
   private def compareFs(srcFs: FileSystem, destFs: FileSystem): Boolean = {
@@ -1450,13 +1421,4 @@ object Client extends Logging {
 components.mkString(Path.SEPARATOR)
   }
 
-  /**
-   * Return whether delegation tokens should be retrieved for the given 
service when security is
-   * enabled. By default, tokens are retrieved, but that behavior can be 
changed by setting
-   * a service-specific configuratio

spark git commit: [MINOR][STREAMING] Fix a minor naming issue in JavaDStreamLike

2016-02-26 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 9812a24aa -> 318bf4115


[MINOR][STREAMING] Fix a minor naming issue in JavaDStreamLike

Author: Liwei Lin 

Closes #11385 from proflin/Fix-minor-naming.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/318bf411
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/318bf411
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/318bf411

Branch: refs/heads/master
Commit: 318bf41158a670e9d62123ea0cb27a833affae24
Parents: 9812a24
Author: Liwei Lin 
Authored: Fri Feb 26 00:21:53 2016 -0800
Committer: Reynold Xin 
Committed: Fri Feb 26 00:21:53 2016 -0800

--
 .../org/apache/spark/streaming/api/java/JavaDStreamLike.scala  | 2 ++
 1 file changed, 2 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/318bf411/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
index 65aab2f..43632f3 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
@@ -50,6 +50,8 @@ trait JavaDStreamLike[T, This <: JavaDStreamLike[T, This, R], 
R <: JavaRDDLike[T
 
   def wrapRDD(in: RDD[T]): R
 
+  // This is just unfortunate we made a mistake in naming -- should be 
scalaLongToJavaLong.
+  // Don't fix this for now as it would break binary compatibility.
   implicit def scalaIntToJavaLong(in: DStream[Long]): JavaDStream[jl.Long] = {
 in.map(jl.Long.valueOf)
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-13518][SQL] Enable vectorized parquet scanner by default

spark git commit: [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts

[2/2] spark git commit: Preparing development version 1.6.2-SNAPSHOT

[1/2] spark git commit: Preparing Spark release v1.6.1-rc1

[spark] Git Push Summary

spark git commit: Update CHANGES.txt and spark-ec2 and R package versions for 1.6.1

spark git commit: [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org

spark git commit: [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org

spark git commit: [SPARK-13519][CORE] Driver should tell Executor to stop itself when cleaning executor's state

spark git commit: [SPARK-13505][ML] add python api for MaxAbsScaler

[spark] Git Push Summary

spark git commit: [SPARK-13465] Add a task failure listener to TaskContext

spark git commit: [SPARK-13499] [SQL] Performance improvements for parquet reader.

spark git commit: [SPARK-13454][SQL] Allow users to drop a table with a name starting with an underscore.

spark git commit: [SPARK-12313] [SQL] improve performance of BroadcastNestedLoopJoin

spark git commit: [MINOR][SQL] Fix modifier order.

spark git commit: [SPARK-11381][DOCS] Replace example code in mllib-linear-methods.md using include_example

spark git commit: [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format

spark git commit: [SPARK-13457][SQL] Removes DataFrame RDD operations

spark git commit: [SPARK-12523][YARN] Support long-running of the Spark On HBase and hive meta store.

spark git commit: [MINOR][STREAMING] Fix a minor naming issue in JavaDStreamLike

21 matches

Site Navigation

Mail list logo

Footer information