spark git commit: [SPARK-13518][SQL] Enable vectorized parquet scanner by default
Repository: spark Updated Branches: refs/heads/master 59e3e10be -> 7a0cb4e58 [SPARK-13518][SQL] Enable vectorized parquet scanner by default ## What changes were proposed in this pull request? Change the default of the flag to enable this feature now that the implementation is complete. ## How was this patch tested? The new parquet reader should be a drop in, so will be exercised by the existing tests. Author: Nong Li Closes #11397 from nongli/spark-13518. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7a0cb4e5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7a0cb4e5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7a0cb4e5 Branch: refs/heads/master Commit: 7a0cb4e58728834b49050ce4fae418acc18a601f Parents: 59e3e10 Author: Nong Li Authored: Fri Feb 26 22:36:32 2016 -0800 Committer: Reynold Xin Committed: Fri Feb 26 22:36:32 2016 -0800 -- .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7a0cb4e5/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 9a50ef7..1d1e288 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -345,12 +345,9 @@ object SQLConf { defaultValue = Some(true), doc = "Enables using the custom ParquetUnsafeRowRecordReader.") - // Note: this can not be enabled all the time because the reader will not be returning UnsafeRows. - // Doing so is very expensive and we should remove this requirement instead of fixing it here. - // Initial testing seems to indicate only sort requires this. val PARQUET_VECTORIZED_READER_ENABLED = booleanConf( key = "spark.sql.parquet.enableVectorizedReader", -defaultValue = Some(false), +defaultValue = Some(true), doc = "Enables vectorized parquet decoding.") val ORC_FILTER_PUSHDOWN_ENABLED = booleanConf("spark.sql.orc.filterPushdown", - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts
Repository: spark Updated Branches: refs/heads/master f77dc4e1e -> 59e3e10be [SPARK-13521][BUILD] Remove reference to Tachyon in cluster & release scripts ## What changes were proposed in this pull request? We provide a very limited set of cluster management script in Spark for Tachyon, although Tachyon itself provides a much better version of it. Given now Spark users can simply use Tachyon as a normal file system and does not require extensive configurations, we can remove this management capabilities to simplify Spark bash scripts. Note that this also reduces coupling between a 3rd party external system and Spark's release scripts, and would eliminate possibility for failures such as Tachyon being renamed or the tar balls being relocated. ## How was this patch tested? N/A Author: Reynold Xin Closes #11400 from rxin/release-script. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/59e3e10b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/59e3e10b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/59e3e10b Branch: refs/heads/master Commit: 59e3e10be2f9a1c53979ca72c038adb4fa17ca64 Parents: f77dc4e Author: Reynold Xin Authored: Fri Feb 26 22:35:12 2016 -0800 Committer: Reynold Xin Committed: Fri Feb 26 22:35:12 2016 -0800 -- docs/configuration.md | 24 docs/job-scheduling.md| 3 +-- docs/programming-guide.md | 22 ++- make-distribution.sh | 50 +- sbin/start-all.sh | 15 ++--- sbin/start-master.sh | 21 -- sbin/start-slaves.sh | 22 --- sbin/stop-master.sh | 4 sbin/stop-slaves.sh | 5 - 9 files changed, 6 insertions(+), 160 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/59e3e10b/docs/configuration.md -- diff --git a/docs/configuration.md b/docs/configuration.md index 4b1b007..e9b6623 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -929,30 +929,6 @@ Apart from these, the following properties are also available, and may be useful mapping has high overhead for blocks close to or below the page size of the operating system. - - spark.externalBlockStore.blockManager - org.apache.spark.storage.TachyonBlockManager - -Implementation of external block manager (file system) that store RDDs. The file system's URL is set by -spark.externalBlockStore.url. - - - - spark.externalBlockStore.baseDir - System.getProperty("java.io.tmpdir") - -Directories of the external block store that store RDDs. The file system's URL is set by - spark.externalBlockStore.url It can also be a comma-separated list of multiple -directories on Tachyon file system. - - - - spark.externalBlockStore.url - tachyon://localhost:19998 for Tachyon - -The URL of the underlying external blocker file system in the external block store. - - Networking http://git-wip-us.apache.org/repos/asf/spark/blob/59e3e10b/docs/job-scheduling.md -- diff --git a/docs/job-scheduling.md b/docs/job-scheduling.md index 95d4779..00b6a18 100644 --- a/docs/job-scheduling.md +++ b/docs/job-scheduling.md @@ -54,8 +54,7 @@ an application to gain back cores on one node when it has work to do. To use thi Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying -the same RDDs. In future releases, in-memory storage systems such as [Tachyon](http://tachyon-project.org) will -provide another approach to share RDDs. +the same RDDs. ## Dynamic Resource Allocation http://git-wip-us.apache.org/repos/asf/spark/blob/59e3e10b/docs/programming-guide.md -- diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 5ebafa4..2f0ed5e 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -1177,7 +1177,7 @@ that originally created it. In addition, each persisted RDD can be stored using a different *storage level*, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), -replicate it across nodes, or store it off-heap in [Tachyon](http://tachyon-project.org/). +replicate it across nodes. These levels are set by passing a `StorageLevel` object ([Scala](api/scala/index.html#org.apache.spark.storage.StorageLevel), [Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html), @@ -1218,24 +1218,11 @@
[2/2] spark git commit: Preparing development version 1.6.2-SNAPSHOT
Preparing development version 1.6.2-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dcf60d79 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dcf60d79 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dcf60d79 Branch: refs/heads/branch-1.6 Commit: dcf60d79e2c46b7bafb67020e89623b060ab462b Parents: 15de51c Author: Patrick Wendell Authored: Fri Feb 26 20:09:11 2016 -0800 Committer: Patrick Wendell Committed: Fri Feb 26 20:09:11 2016 -0800 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- docker-integration-tests/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt-assembly/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tags/pom.xml| 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 35 files changed, 35 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 87559fd..ff34522 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1 +1.6.2-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 1050217..e31cf2f 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1 +1.6.2-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 1399103..4f909ba 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1 +1.6.2-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/docker-integration-tests/pom.xml -- diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml index 57e64df..a136781 100644 --- a/docker-integration-tests/pom.xml +++ b/docker-integration-tests/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.10 -1.6.1 +1.6.2-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index a713135..ed97dc2 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1 +1.6.2-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index cbc64bd..0d6b563 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1 +1.6.2-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/dcf60d79/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml index 26a1456
[1/2] spark git commit: Preparing Spark release v1.6.1-rc1
Repository: spark Updated Branches: refs/heads/branch-1.6 eb6f6da48 -> dcf60d79e Preparing Spark release v1.6.1-rc1 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/15de51c2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/15de51c2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/15de51c2 Branch: refs/heads/branch-1.6 Commit: 15de51c238a7340fa81cb0b80d029a05d97bfc5c Parents: eb6f6da Author: Patrick Wendell Authored: Fri Feb 26 20:09:04 2016 -0800 Committer: Patrick Wendell Committed: Fri Feb 26 20:09:04 2016 -0800 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- docker-integration-tests/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt-assembly/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tags/pom.xml| 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 35 files changed, 35 insertions(+), 35 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 2087e43..87559fd 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1-SNAPSHOT +1.6.1 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 4b4fc24..1050217 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1-SNAPSHOT +1.6.1 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index e1b2c1c..1399103 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1-SNAPSHOT +1.6.1 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/docker-integration-tests/pom.xml -- diff --git a/docker-integration-tests/pom.xml b/docker-integration-tests/pom.xml index 14b6eac..57e64df 100644 --- a/docker-integration-tests/pom.xml +++ b/docker-integration-tests/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.10 -1.6.1-SNAPSHOT +1.6.1 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index 90cf2b8..a713135 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1-SNAPSHOT +1.6.1 ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index b78f3d4..cbc64bd 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.10 -1.6.1-SNAPSHOT +1.6.1 ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/15de51c2/external/flume-sink/pom.xml -- diff --gi
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.6.1-rc1 [created] 15de51c23 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Update CHANGES.txt and spark-ec2 and R package versions for 1.6.1
Repository: spark Updated Branches: refs/heads/branch-1.6 8a43c3bfb -> eb6f6da48 Update CHANGES.txt and spark-ec2 and R package versions for 1.6.1 This patch updates a few more 1.6.0 version numbers for the 1.6.1 release candidate. Verified this by running ``` git grep "1\.6\.0" | grep -v since | grep -v deprecated | grep -v Since | grep -v versionadded | grep 1.6.0 ``` and inspecting the output. Author: Josh Rosen Closes #11407 from JoshRosen/version-number-updates. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb6f6da4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb6f6da4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb6f6da4 Branch: refs/heads/branch-1.6 Commit: eb6f6da484b4390c5b196d8426a49609b6a6fc7c Parents: 8a43c3b Author: Josh Rosen Authored: Fri Feb 26 20:05:44 2016 -0800 Committer: Josh Rosen Committed: Fri Feb 26 20:05:44 2016 -0800 -- CHANGES.txt | 788 + R/pkg/DESCRIPTION | 2 +- dev/create-release/generate-changelist.py | 4 +- ec2/spark_ec2.py | 4 +- 4 files changed, 794 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/eb6f6da4/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index ff59371..f66bef9 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,6 +1,794 @@ Spark Change Log +Release 1.6.1 + + [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org + Josh Rosen + 2016-02-26 18:40:00 -0800 + Commit: 8a43c3b, github.com/apache/spark/pull/11350 + + [SPARK-13454][SQL] Allow users to drop a table with a name starting with an underscore. + Yin Huai + 2016-02-26 12:34:03 -0800 + Commit: a57f87e, github.com/apache/spark/pull/11349 + + [SPARK-12874][ML] ML StringIndexer does not protect itself from column name duplication + Yu ISHIKAWA + 2016-02-25 13:21:33 -0800 + Commit: abe8f99, github.com/apache/spark/pull/11370 + + Revert "[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames" + Xiangrui Meng + 2016-02-25 12:28:03 -0800 + Commit: d59a08f + + [SPARK-12316] Wait a minutes to avoid cycle calling. + huangzhaowei + 2016-02-25 09:14:19 -0600 + Commit: 5f7440b2, github.com/apache/spark/pull/10475 + + [SPARK-13439][MESOS] Document that spark.mesos.uris is comma-separated + Michael Gummelt + 2016-02-25 13:32:09 + + Commit: e3802a7, github.com/apache/spark/pull/11311 + + [SPARK-13441][YARN] Fix NPE in yarn Client.createConfArchive method + Terence Yim + 2016-02-25 13:29:30 + + Commit: 1f03163, github.com/apache/spark/pull/11337 + + [SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames + Oliver Pierson , Oliver Pierson + 2016-02-25 13:24:46 + + Commit: cb869a1, github.com/apache/spark/pull/11319 + + [SPARK-13473][SQL] Don't push predicate through project with nondeterministic field(s) + Cheng Lian + 2016-02-25 20:43:03 +0800 + Commit: 3cc938a, github.com/apache/spark/pull/11348 + + [SPARK-13482][MINOR][CONFIGURATION] Make consistency of the configuraiton named in TransportConf. + huangzhaowei + 2016-02-24 23:52:17 -0800 + Commit: 8975996, github.com/apache/spark/pull/11360 + + [SPARK-13475][TESTS][SQL] HiveCompatibilitySuite should still run in PR builder even if a PR only changes sql/core + Yin Huai + 2016-02-24 13:34:53 -0800 + Commit: fe71cab, github.com/apache/spark/pull/11351 + + [SPARK-13390][SQL][BRANCH-1.6] Fix the issue that Iterator.map().toSeq is not Serializable + Shixiong Zhu + 2016-02-24 13:35:36 + + Commit: 06f4fce, github.com/apache/spark/pull/11334 + + [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. + Franklyn D'souza + 2016-02-23 15:34:04 -0800 + Commit: 573a2c9, github.com/apache/spark/pull/11333 + + [SPARK-13355][MLLIB] replace GraphImpl.fromExistingRDDs by Graph.apply + Xiangrui Meng + 2016-02-22 23:54:21 -0800 + Commit: 0784e02, github.com/apache/spark/pull/11226 + + [SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false) fix for branch-1.6 + Earthson Lu + 2016-02-22 23:40:36 -0800 + Commit: d31854d, github.com/apache/spark/pull/11237 + + Preparing development version 1.6.1-SNAPSHOT + Patrick Wendell + 2016-02-22 18:30:30 -0800 + Commit: 2902798 + + Preparing Spark release v1.6.1-rc1 + Patrick Wendell + 2016-02-22 18:30:24 -0800 + Commit: 152252f + + Update branch-1.6 for 1.6.1 release + Michael Armbrust + 2016-02-22 18:25:48 -0800 + Commit: 40d11d0 + + [SPARK-11624][SPARK-11972][SQL] fix commands that need hive to exec + Daoyuan Wang + 2016-02-22 18:13:32 -080
spark git commit: [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org
Repository: spark Updated Branches: refs/heads/master ad615291f -> f77dc4e1e [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync. Author: Josh Rosen Closes #11350 from JoshRosen/update-release-scripts-for-apache-home. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f77dc4e1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f77dc4e1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f77dc4e1 Branch: refs/heads/master Commit: f77dc4e1e202942aa8393fb5d8f492863973fe17 Parents: ad61529 Author: Josh Rosen Authored: Fri Feb 26 18:40:00 2016 -0800 Committer: Josh Rosen Committed: Fri Feb 26 18:40:00 2016 -0800 -- dev/create-release/release-build.sh | 60 +++- 1 file changed, 44 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f77dc4e1/dev/create-release/release-build.sh -- diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index 2fd7fcc..c08b6d7 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -23,8 +23,8 @@ usage: release-build.sh Creates build deliverables from a Spark commit. Top level targets are - package: Create binary packages and copy them to people.apache - docs: Build docs and copy them to people.apache + package: Create binary packages and copy them to home.apache + docs: Build docs and copy them to home.apache publish-snapshot: Publish snapshot release to Apache snapshots publish-release: Publish a release to Apache release repo @@ -64,13 +64,16 @@ for env in ASF_USERNAME ASF_RSA_KEY GPG_PASSPHRASE GPG_KEY; do fi done +# Explicitly set locale in order to make `sort` output consistent across machines. +# See https://stackoverflow.com/questions/28881 for more details. +export LC_ALL=C + # Commit ref to checkout when building GIT_REF=${GIT_REF:-master} # Destination directory parent on remote server REMOTE_PARENT_DIR=${REMOTE_PARENT_DIR:-/home/$ASF_USERNAME/public_html} -SSH="ssh -o ConnectTimeout=300 -o StrictHostKeyChecking=no -i $ASF_RSA_KEY" GPG="gpg --no-tty --batch" NEXUS_ROOT=https://repository.apache.org/service/local/staging NEXUS_PROFILE=d63f592e7eac0 # Profile for Spark staging uploads @@ -97,7 +100,20 @@ if [ -z "$SPARK_PACKAGE_VERSION" ]; then fi DEST_DIR_NAME="spark-$SPARK_PACKAGE_VERSION" -USER_HOST="$asf_usern...@people.apache.org" + +function LFTP { + SSH="ssh -o ConnectTimeout=300 -o StrictHostKeyChecking=no -i $ASF_RSA_KEY" + COMMANDS=$(cat
spark git commit: [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org
Repository: spark Updated Branches: refs/heads/branch-1.6 a57f87ee4 -> 8a43c3bfb [SPARK-13474][PROJECT INFRA] Update packaging scripts to push artifacts to home.apache.org Due to the people.apache.org -> home.apache.org migration, we need to update our packaging scripts to publish artifacts to the new server. Because the new server only supports sftp instead of ssh, we need to update the scripts to use lftp instead of ssh + rsync. Author: Josh Rosen Closes #11350 from JoshRosen/update-release-scripts-for-apache-home. (cherry picked from commit f77dc4e1e202942aa8393fb5d8f492863973fe17) Signed-off-by: Josh Rosen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a43c3bf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a43c3bf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a43c3bf Branch: refs/heads/branch-1.6 Commit: 8a43c3bfbcd9d6e3876e09363dba604dc7e63dc3 Parents: a57f87e Author: Josh Rosen Authored: Fri Feb 26 18:40:00 2016 -0800 Committer: Josh Rosen Committed: Fri Feb 26 18:40:23 2016 -0800 -- dev/create-release/release-build.sh | 60 +++- 1 file changed, 44 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8a43c3bf/dev/create-release/release-build.sh -- diff --git a/dev/create-release/release-build.sh b/dev/create-release/release-build.sh index cb79e9e..2c3af6a 100755 --- a/dev/create-release/release-build.sh +++ b/dev/create-release/release-build.sh @@ -23,8 +23,8 @@ usage: release-build.sh Creates build deliverables from a Spark commit. Top level targets are - package: Create binary packages and copy them to people.apache - docs: Build docs and copy them to people.apache + package: Create binary packages and copy them to home.apache + docs: Build docs and copy them to home.apache publish-snapshot: Publish snapshot release to Apache snapshots publish-release: Publish a release to Apache release repo @@ -64,13 +64,16 @@ for env in ASF_USERNAME ASF_RSA_KEY GPG_PASSPHRASE GPG_KEY; do fi done +# Explicitly set locale in order to make `sort` output consistent across machines. +# See https://stackoverflow.com/questions/28881 for more details. +export LC_ALL=C + # Commit ref to checkout when building GIT_REF=${GIT_REF:-master} # Destination directory parent on remote server REMOTE_PARENT_DIR=${REMOTE_PARENT_DIR:-/home/$ASF_USERNAME/public_html} -SSH="ssh -o ConnectTimeout=300 -o StrictHostKeyChecking=no -i $ASF_RSA_KEY" GPG="gpg --no-tty --batch" NEXUS_ROOT=https://repository.apache.org/service/local/staging NEXUS_PROFILE=d63f592e7eac0 # Profile for Spark staging uploads @@ -97,7 +100,20 @@ if [ -z "$SPARK_PACKAGE_VERSION" ]; then fi DEST_DIR_NAME="spark-$SPARK_PACKAGE_VERSION" -USER_HOST="$asf_usern...@people.apache.org" + +function LFTP { + SSH="ssh -o ConnectTimeout=300 -o StrictHostKeyChecking=no -i $ASF_RSA_KEY" + COMMANDS=$(cat
spark git commit: [SPARK-13519][CORE] Driver should tell Executor to stop itself when cleaning executor's state
Repository: spark Updated Branches: refs/heads/master 1e5fcdf96 -> ad615291f [SPARK-13519][CORE] Driver should tell Executor to stop itself when cleaning executor's state ## What changes were proposed in this pull request? When the driver removes an executor's state, the connection between the driver and the executor may be still alive so that the executor cannot exit automatically (E.g., Master will send RemoveExecutor when a work is lost but the executor is still alive), so the driver should try to tell the executor to stop itself. Otherwise, we will leak an executor. This PR modified the driver to send `StopExecutor` to the executor when it's removed. ## How was this patch tested? manual test: increase the worker heartbeat interval to force it's always timeout and the leak executors are gone. Author: Shixiong Zhu Closes #11399 from zsxwing/SPARK-13519. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad615291 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad615291 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad615291 Branch: refs/heads/master Commit: ad615291fe76580ee59e3f48f4efe4627a01409d Parents: 1e5fcdf Author: Shixiong Zhu Authored: Fri Feb 26 15:11:57 2016 -0800 Committer: Andrew Or Committed: Fri Feb 26 15:11:57 2016 -0800 -- .../spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala | 4 1 file changed, 4 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ad615291/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala index 0a5b09d..d151de5 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala @@ -179,6 +179,10 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp context.reply(true) case RemoveExecutor(executorId, reason) => +// We will remove the executor's state and cannot restore it. However, the connection +// between the driver and the executor may be still alive so that the executor won't exit +// automatically, so try to tell the executor to stop itself. See SPARK-13519. + executorDataMap.get(executorId).foreach(_.executorEndpoint.send(StopExecutor)) removeExecutor(executorId, reason) context.reply(true) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-13505][ML] add python api for MaxAbsScaler
Repository: spark Updated Branches: refs/heads/master 391755dc6 -> 1e5fcdf96 [SPARK-13505][ML] add python api for MaxAbsScaler ## What changes were proposed in this pull request? After SPARK-13028, we should add Python API for MaxAbsScaler. ## How was this patch tested? unit test Author: zlpmichelle Closes #11393 from zlpmichelle/master. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e5fcdf9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e5fcdf9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e5fcdf9 Branch: refs/heads/master Commit: 1e5fcdf96c0176a11e5f425ba539b6ed629281db Parents: 391755d Author: zlpmichelle Authored: Fri Feb 26 14:37:44 2016 -0800 Committer: Xiangrui Meng Committed: Fri Feb 26 14:37:44 2016 -0800 -- python/pyspark/ml/feature.py | 75 +++ 1 file changed, 68 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1e5fcdf9/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 67bccfa..369f350 100644 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -28,13 +28,14 @@ from pyspark.mllib.common import inherit_doc from pyspark.mllib.linalg import _convert_to_vector __all__ = ['Binarizer', 'Bucketizer', 'CountVectorizer', 'CountVectorizerModel', 'DCT', - 'ElementwiseProduct', 'HashingTF', 'IDF', 'IDFModel', 'IndexToString', 'MinMaxScaler', - 'MinMaxScalerModel', 'NGram', 'Normalizer', 'OneHotEncoder', 'PCA', 'PCAModel', - 'PolynomialExpansion', 'QuantileDiscretizer', 'RegexTokenizer', 'RFormula', - 'RFormulaModel', 'SQLTransformer', 'StandardScaler', 'StandardScalerModel', - 'StopWordsRemover', 'StringIndexer', 'StringIndexerModel', 'Tokenizer', - 'VectorAssembler', 'VectorIndexer', 'VectorSlicer', 'Word2Vec', 'Word2VecModel', - 'ChiSqSelector', 'ChiSqSelectorModel'] + 'ElementwiseProduct', 'HashingTF', 'IDF', 'IDFModel', 'IndexToString', + 'MaxAbsScaler', 'MaxAbsScalerModel', 'MinMaxScaler', 'MinMaxScalerModel', + 'NGram', 'Normalizer', 'OneHotEncoder', 'PCA', 'PCAModel', 'PolynomialExpansion', + 'QuantileDiscretizer', 'RegexTokenizer', 'RFormula', 'RFormulaModel', + 'SQLTransformer', 'StandardScaler', 'StandardScalerModel', 'StopWordsRemover', + 'StringIndexer', 'StringIndexerModel', 'Tokenizer', 'VectorAssembler', + 'VectorIndexer', 'VectorSlicer', 'Word2Vec', 'Word2VecModel', 'ChiSqSelector', + 'ChiSqSelectorModel'] @inherit_doc @@ -545,6 +546,66 @@ class IDFModel(JavaModel): @inherit_doc +class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol): +""" +.. note:: Experimental + +Rescale each feature individually to range [-1, 1] by dividing through the largest maximum +absolute value in each feature. It does not shift/center the data, and thus does not destroy +any sparsity. + +>>> from pyspark.mllib.linalg import Vectors +>>> df = sqlContext.createDataFrame([(Vectors.dense([1.0]),), (Vectors.dense([2.0]),)], ["a"]) +>>> maScaler = MaxAbsScaler(inputCol="a", outputCol="scaled") +>>> model = maScaler.fit(df) +>>> model.transform(df).show() ++-+--+ +|a|scaled| ++-+--+ +|[1.0]| [0.5]| +|[2.0]| [1.0]| ++-+--+ +... + +.. versionadded:: 2.0.0 +""" + +@keyword_only +def __init__(self, inputCol=None, outputCol=None): +""" +__init__(self, inputCol=None, outputCol=None) +""" +super(MaxAbsScaler, self).__init__() +self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.MaxAbsScaler", self.uid) +self._setDefault() +kwargs = self.__init__._input_kwargs +self.setParams(**kwargs) + +@keyword_only +@since("2.0.0") +def setParams(self, inputCol=None, outputCol=None): +""" +setParams(self, inputCol=None, outputCol=None) +Sets params for this MaxAbsScaler. +""" +kwargs = self.setParams._input_kwargs +return self._set(**kwargs) + +def _create_model(self, java_model): +return MaxAbsScalerModel(java_model) + + +class MaxAbsScalerModel(JavaModel): +""" +.. note:: Experimental + +Model fitted by :py:class:`MaxAbsScaler`. + +.. versionadded:: 2.0.0 +""" + + +@inherit_doc class MinMaxScaler(JavaEstimator, HasInputCol, HasOutputCol): """ .. note:: Experimental - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.6.1-rc1 [deleted] 152252f15 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-13465] Add a task failure listener to TaskContext
Repository: spark Updated Branches: refs/heads/master 0598a2b81 -> 391755dc6 [SPARK-13465] Add a task failure listener to TaskContext ## What changes were proposed in this pull request? TaskContext supports task completion callback, which gets called regardless of task failures. However, there is no way for the listener to know if there is an error. This patch adds a new listener that gets called when a task fails. ## How was the this patch tested? New unit test case and integration test case covering the code path Author: Reynold Xin Closes #11340 from rxin/SPARK-13465. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/391755dc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/391755dc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/391755dc Branch: refs/heads/master Commit: 391755dc6ed2e156b8df8a530ac8df6ed7ba7f8a Parents: 0598a2b Author: Reynold Xin Authored: Fri Feb 26 12:49:16 2016 -0800 Committer: Davies Liu Committed: Fri Feb 26 12:49:16 2016 -0800 -- .../scala/org/apache/spark/TaskContext.scala| 28 +++- .../org/apache/spark/TaskContextImpl.scala | 33 -- .../scala/org/apache/spark/scheduler/Task.scala | 5 ++ .../spark/util/TaskCompletionListener.scala | 33 -- .../util/TaskCompletionListenerException.scala | 34 -- .../org/apache/spark/util/taskListeners.scala | 68 .../spark/JavaTaskCompletionListenerImpl.java | 39 --- .../spark/JavaTaskContextCompileCheck.java | 30 + .../spark/scheduler/TaskContextSuite.scala | 44 - project/MimaExcludes.scala | 4 +- 10 files changed, 201 insertions(+), 117 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/391755dc/core/src/main/scala/org/apache/spark/TaskContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/TaskContext.scala b/core/src/main/scala/org/apache/spark/TaskContext.scala index 9f49cf1..bfcacbf 100644 --- a/core/src/main/scala/org/apache/spark/TaskContext.scala +++ b/core/src/main/scala/org/apache/spark/TaskContext.scala @@ -23,7 +23,7 @@ import org.apache.spark.annotation.DeveloperApi import org.apache.spark.executor.TaskMetrics import org.apache.spark.memory.TaskMemoryManager import org.apache.spark.metrics.source.Source -import org.apache.spark.util.TaskCompletionListener +import org.apache.spark.util.{TaskCompletionListener, TaskFailureListener} object TaskContext { @@ -106,6 +106,8 @@ abstract class TaskContext extends Serializable { * Adds a (Java friendly) listener to be executed on task completion. * This will be called in all situation - success, failure, or cancellation. * An example use is for HadoopRDD to register a callback to close the input stream. + * + * Exceptions thrown by the listener will result in failure of the task. */ def addTaskCompletionListener(listener: TaskCompletionListener): TaskContext @@ -113,8 +115,30 @@ abstract class TaskContext extends Serializable { * Adds a listener in the form of a Scala closure to be executed on task completion. * This will be called in all situations - success, failure, or cancellation. * An example use is for HadoopRDD to register a callback to close the input stream. + * + * Exceptions thrown by the listener will result in failure of the task. */ - def addTaskCompletionListener(f: (TaskContext) => Unit): TaskContext + def addTaskCompletionListener(f: (TaskContext) => Unit): TaskContext = { +addTaskCompletionListener(new TaskCompletionListener { + override def onTaskCompletion(context: TaskContext): Unit = f(context) +}) + } + + /** + * Adds a listener to be executed on task failure. + * Operations defined here must be idempotent, as `onTaskFailure` can be called multiple times. + */ + def addTaskFailureListener(listener: TaskFailureListener): TaskContext + + /** + * Adds a listener to be executed on task failure. + * Operations defined here must be idempotent, as `onTaskFailure` can be called multiple times. + */ + def addTaskFailureListener(f: (TaskContext, Throwable) => Unit): TaskContext = { +addTaskFailureListener(new TaskFailureListener { + override def onTaskFailure(context: TaskContext, error: Throwable): Unit = f(context, error) +}) + } /** * The ID of the stage that this task belong to. http://git-wip-us.apache.org/repos/asf/spark/blob/391755dc/core/src/main/scala/org/apache/spark/TaskContextImpl.scala -- diff --git a/core/src/main/scala/org/apache/spark/TaskContextImpl.scala b/core/src/main/scala/org/apache/spark/TaskContextImpl.scala inde
spark git commit: [SPARK-13499] [SQL] Performance improvements for parquet reader.
Repository: spark Updated Branches: refs/heads/master 6df1e55a6 -> 0598a2b81 [SPARK-13499] [SQL] Performance improvements for parquet reader. ## What changes were proposed in this pull request? This patch includes these performance fixes: - Remove unnecessary setNotNull() calls. The NULL bits are cleared already. - Speed up RLE group decoding - Speed up dictionary decoding by decoding NULLs directly into the result. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) In addition to the updated benchmarks, on TPCDS, the result of these changes running Q55 (sf40) is: ``` TPCDS: Best/Avg Time(ms)Rate(M/s) Per Row(ns) - q55 (Before) 6398 / 6616 18.0 55.5 q55 (After) 4983 / 5189 23.1 43.3 ``` Author: Nong Li Closes #11375 from nongli/spark-13499. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0598a2b8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0598a2b8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0598a2b8 Branch: refs/heads/master Commit: 0598a2b81d1426dd2cf9e6fc32cef345364d18c6 Parents: 6df1e55 Author: Nong Li Authored: Fri Feb 26 12:43:50 2016 -0800 Committer: Davies Liu Committed: Fri Feb 26 12:43:50 2016 -0800 -- .../parquet/UnsafeRowParquetRecordReader.java | 17 + .../parquet/VectorizedRleValuesReader.java | 66 .../parquet/ParquetReadBenchmark.scala | 30 - 3 files changed, 59 insertions(+), 54 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0598a2b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java index 4576ac2..9d50cfa 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java @@ -628,7 +628,8 @@ public class UnsafeRowParquetRecordReader extends SpecificParquetRecordReaderBas dictionaryIds.reserve(total); } // Read and decode dictionary ids. - readIntBatch(rowId, num, dictionaryIds); + defColumn.readIntegers( + num, dictionaryIds, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn); decodeDictionaryIds(rowId, num, column); } else { switch (descriptor.getType()) { @@ -739,18 +740,6 @@ public class UnsafeRowParquetRecordReader extends SpecificParquetRecordReaderBas default: throw new NotImplementedException("Unsupported type: " + descriptor.getType()); } - - if (dictionaryIds.numNulls() > 0) { -// Copy the NULLs over. -// TODO: we can improve this by decoding the NULLs directly into column. This would -// mean we decode the int ids into `dictionaryIds` and the NULLs into `column` and then -// just do the ID remapping as above. -for (int i = 0; i < num; ++i) { - if (dictionaryIds.getIsNull(rowId + i)) { -column.putNull(rowId + i); - } -} - } } /** @@ -769,7 +758,7 @@ public class UnsafeRowParquetRecordReader extends SpecificParquetRecordReaderBas // TODO: implement remaining type conversions if (column.dataType() == DataTypes.IntegerType || column.dataType() == DataTypes.DateType) { defColumn.readIntegers( -num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn, 0); +num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn); } else if (column.dataType() == DataTypes.ByteType) { defColumn.readBytes( num, column, rowId, maxDefLevel, (VectorizedValuesReader) dataColumn); http://git-wip-us.apache.org/repos/asf/spark/blob/0598a2b8/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java index 629959a..
spark git commit: [SPARK-13454][SQL] Allow users to drop a table with a name starting with an underscore.
Repository: spark Updated Branches: refs/heads/branch-1.6 abe8f991a -> a57f87ee4 [SPARK-13454][SQL] Allow users to drop a table with a name starting with an underscore. ## What changes were proposed in this pull request? This change adds a workaround to allow users to drop a table with a name starting with an underscore. Without this patch, we can create such a table, but we cannot drop it. The reason is that Hive's parser unquote an quoted identifier (see https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g#L453). So, when we issue a drop table command to Hive, a table name starting with an underscore is actually not quoted. Then, Hive will complain about it because it does not support a table name starting with an underscore without using backticks (underscores are allowed as long as it is not the first char though). ## How was this patch tested? Add a test to make sure we can drop a table with a name starting with an underscore. https://issues.apache.org/jira/browse/SPARK-13454 Author: Yin Huai Closes #11349 from yhuai/fixDropTable. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a57f87ee Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a57f87ee Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a57f87ee Branch: refs/heads/branch-1.6 Commit: a57f87ee4aafdb97c15f4076e20034ea34c7e2e5 Parents: abe8f99 Author: Yin Huai Authored: Fri Feb 26 12:34:03 2016 -0800 Committer: Yin Huai Committed: Fri Feb 26 12:34:03 2016 -0800 -- .../spark/sql/hive/execution/commands.scala | 20 +-- .../sql/hive/HiveMetastoreCatalogSuite.scala| 21 2 files changed, 39 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a57f87ee/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala index 94210a5..6b16d59 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala @@ -20,7 +20,7 @@ package org.apache.spark.sql.hive.execution import org.apache.hadoop.hive.metastore.MetaStoreUtils import org.apache.spark.sql._ -import org.apache.spark.sql.catalyst.TableIdentifier +import org.apache.spark.sql.catalyst.{SqlParser, TableIdentifier} import org.apache.spark.sql.catalyst.analysis.EliminateSubQueries import org.apache.spark.sql.catalyst.expressions.Attribute import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan @@ -70,7 +70,23 @@ case class DropTable( case e: Throwable => log.warn(s"${e.getMessage}", e) } hiveContext.invalidateTable(tableName) -hiveContext.runSqlHive(s"DROP TABLE $ifExistsClause$tableName") +val tableNameForHive = { + // Hive's parser will unquote an identifier (see the rule of QuotedIdentifier in + // HiveLexer.g of Hive 1.2.1). For the DROP TABLE command that we pass in Hive, we + // will use the quoted form (db.tableName) if the table name starts with a _. + // Otherwise, we keep the unquoted form (`db`.`tableName`), which is the same as tableName + // passed into this DropTable class. Please note that although QuotedIdentifier rule + // allows backticks appearing in an identifier, Hive does not actually allow such + // an identifier be a table name. So, we do not check if a table name part has + // any backtick or not. + // + // This change is at here because this patch is just for 1.6 branch and we try to + // avoid of affecting normal cases (tables do not use _ as the first character of + // their name). + val identifier = SqlParser.parseTableIdentifier(tableName) + if (identifier.table.startsWith("_")) identifier.quotedString else identifier.unquotedString +} +hiveContext.runSqlHive(s"DROP TABLE $ifExistsClause$tableNameForHive") hiveContext.catalog.unregisterTable(TableIdentifier(tableName)) Seq.empty[Row] } http://git-wip-us.apache.org/repos/asf/spark/blob/a57f87ee/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala index d63f3d3..15fd96a 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala +++ b/sql/hive/src/test/scala/or
spark git commit: [SPARK-12313] [SQL] improve performance of BroadcastNestedLoopJoin
Repository: spark Updated Branches: refs/heads/master 727e78014 -> 6df1e55a6 [SPARK-12313] [SQL] improve performance of BroadcastNestedLoopJoin ## What changes were proposed in this pull request? Currently, BroadcastNestedLoopJoin is implemented for worst case, it's too slow, very easy to hang forever. This PR will create fast path for some joinType and buildSide, also improve the worst case (will use much less memory than before). Before this PR, one task requires O(N*K) + O(K) in worst cases, N is number of rows from one partition of streamed table, it could hang the job (because of GC). In order to workaround this for InnerJoin, we have to disable auto-broadcast, switch to CartesianProduct: This could be workaround for InnerJoin, see https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html In this PR, we will have fast path for these joins : InnerJoin with BuildLeft or BuildRight LeftOuterJoin with BuildRight RightOuterJoin with BuildLeft LeftSemi with BuildRight These fast paths are all stream based (take one pass on streamed table), required O(1) memory. All other join types and build types will take two pass on streamed table, one pass to find the matched rows that includes streamed part, which require O(1) memory, another pass to find the rows from build table that does not have a matched row from streamed table, which required O(K) memory, K is the number rows from build side, one bit per row, should be much smaller than the memory for broadcast. The following join types work in this way: LeftOuterJoin with BuildLeft RightOuterJoin with BuildRight FullOuterJoin with BuildLeft or BuildRight LeftSemi with BuildLeft This PR also added tests for all the join types for BroadcastNestedLoopJoin. After this PR, for InnerJoin with one small table, BroadcastNestedLoopJoin should be faster than CartesianProduct, we don't need that workaround anymore. ## How was the this patch tested? Added unit tests. Author: Davies Liu Closes #11328 from davies/nested_loop. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6df1e55a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6df1e55a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6df1e55a Branch: refs/heads/master Commit: 6df1e55a6594ae4bc7882f44af8d230aad9489b4 Parents: 727e780 Author: Davies Liu Authored: Fri Feb 26 09:58:05 2016 -0800 Committer: Davies Liu Committed: Fri Feb 26 09:58:05 2016 -0800 -- .../catalyst/plans/physical/broadcastMode.scala | 1 + .../spark/sql/execution/SparkStrategies.scala | 14 +- .../joins/BroadcastNestedLoopJoin.scala | 295 ++- .../scala/org/apache/spark/sql/JoinSuite.scala | 11 +- .../sql/execution/joins/InnerJoinSuite.scala| 27 ++ .../sql/execution/joins/OuterJoinSuite.scala| 18 ++ .../sql/execution/joins/SemiJoinSuite.scala | 20 +- 7 files changed, 295 insertions(+), 91 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6df1e55a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala index c646dcf..e01f69f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/broadcastMode.scala @@ -31,5 +31,6 @@ trait BroadcastMode { * IdentityBroadcastMode requires that rows are broadcasted in their original form. */ case object IdentityBroadcastMode extends BroadcastMode { + // TODO: pack the UnsafeRows into single bytes array. override def transform(rows: Array[InternalRow]): Array[InternalRow] = rows } http://git-wip-us.apache.org/repos/asf/spark/blob/6df1e55a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala index 5fdf38c..dd8c96d 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala @@ -253,22 +253,19 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { object BroadcastNestedLoop extends Strategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { - case logical.Join( - CanBroad
spark git commit: [MINOR][SQL] Fix modifier order.
Repository: spark Updated Branches: refs/heads/master 7af0de076 -> 727e78014 [MINOR][SQL] Fix modifier order. ## What changes were proposed in this pull request? This PR fixes the order of modifier from `abstract public` into `public abstract`. Currently, when we run `./dev/lint-java`, it shows the error. ``` Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/sketch/CountMinSketch.java:[53,10] (modifier) ModifierOrder: 'public' modifier out of order with the JLS suggestions. ``` ## How was this patch tested? ``` $ ./dev/lint-java Checkstyle checks passed. ``` Author: Dongjoon Hyun Closes #11390 from dongjoon-hyun/fix_modifier_order. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/727e7801 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/727e7801 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/727e7801 Branch: refs/heads/master Commit: 727e78014fd4957e477d62adc977fa4da3e3455d Parents: 7af0de0 Author: Dongjoon Hyun Authored: Fri Feb 26 17:11:19 2016 + Committer: Sean Owen Committed: Fri Feb 26 17:11:19 2016 + -- .../src/main/java/org/apache/spark/util/sketch/CountMinSketch.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/727e7801/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java -- diff --git a/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java b/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java index 2c9aa93..40fa20c 100644 --- a/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java +++ b/common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java @@ -50,7 +50,7 @@ import java.io.OutputStream; * * This implementation is largely based on the {@code CountMinSketch} class from stream-lib. */ -abstract public class CountMinSketch { +public abstract class CountMinSketch { public enum Version { /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-11381][DOCS] Replace example code in mllib-linear-methods.md using include_example
Repository: spark Updated Branches: refs/heads/master b33261f91 -> 7af0de076 [SPARK-11381][DOCS] Replace example code in mllib-linear-methods.md using include_example ## What changes were proposed in this pull request? This PR replaces example codes in `mllib-linear-methods.md` using `include_example` by doing the followings: * Extracts the example codes(Scala,Java,Python) as files in `example` module. * Merges some dialog-style examples into a single file. * Hide redundant codes in HTML for the consistency with other docs. ## How was the this patch tested? manual test. This PR can be tested by document generations, `SKIP_API=1 jekyll build`. Author: Dongjoon Hyun Closes #11320 from dongjoon-hyun/SPARK-11381. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7af0de07 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7af0de07 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7af0de07 Branch: refs/heads/master Commit: 7af0de076f74e975c9235c88b0f11b22fcbae060 Parents: b33261f Author: Dongjoon Hyun Authored: Fri Feb 26 08:31:55 2016 -0800 Committer: Xiangrui Meng Committed: Fri Feb 26 08:31:55 2016 -0800 -- docs/mllib-linear-methods.md| 441 ++- .../JavaLinearRegressionWithSGDExample.java | 94 .../JavaLogisticRegressionWithLBFGSExample.java | 79 .../examples/mllib/JavaSVMWithSGDExample.java | 82 .../mllib/linear_regression_with_sgd_example.py | 54 +++ .../logistic_regression_with_lbfgs_example.py | 54 +++ .../streaming_linear_regression_example.py | 62 +++ .../main/python/mllib/svm_with_sgd_example.py | 47 ++ .../mllib/LinearRegressionWithSGDExample.scala | 64 +++ .../LogisticRegressionWithLBFGSExample.scala| 69 +++ .../examples/mllib/SVMWithSGDExample.scala | 70 +++ .../StreamingLinearRegressionExample.scala | 58 +++ 12 files changed, 758 insertions(+), 416 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7af0de07/docs/mllib-linear-methods.md -- diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index aac8f75..63665c4 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -170,42 +170,7 @@ error. Refer to the [`SVMWithSGD` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD) and [`SVMModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMModel) for details on the API. -{% highlight scala %} -import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} -import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics -import org.apache.spark.mllib.util.MLUtils - -// Load training data in LIBSVM format. -val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt") - -// Split data into training (60%) and test (40%). -val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) -val training = splits(0).cache() -val test = splits(1) - -// Run training algorithm to build the model -val numIterations = 100 -val model = SVMWithSGD.train(training, numIterations) - -// Clear the default threshold. -model.clearThreshold() - -// Compute raw scores on the test set. -val scoreAndLabels = test.map { point => - val score = model.predict(point.features) - (score, point.label) -} - -// Get evaluation metrics. -val metrics = new BinaryClassificationMetrics(scoreAndLabels) -val auROC = metrics.areaUnderROC() - -println("Area under ROC = " + auROC) - -// Save and load model -model.save(sc, "myModelPath") -val sameModel = SVMModel.load(sc, "myModelPath") -{% endhighlight %} +{% include_example scala/org/apache/spark/examples/mllib/SVMWithSGDExample.scala %} The `SVMWithSGD.train()` method by default performs L2 regularization with the regularization parameter set to 1.0. If we want to configure this algorithm, we @@ -216,6 +181,7 @@ variant of SVMs with regularization parameter set to 0.1, and runs the training algorithm for 200 iterations. {% highlight scala %} + import org.apache.spark.mllib.optimization.L1Updater val svmAlg = new SVMWithSGD() @@ -237,61 +203,7 @@ that is equivalent to the provided example in Scala is given below: Refer to the [`SVMWithSGD` Java docs](api/java/org/apache/spark/mllib/classification/SVMWithSGD.html) and [`SVMModel` Java docs](api/java/org/apache/spark/mllib/classification/SVMModel.html) for details on the API. -{% highlight java %} -import scala.Tuple2; - -import org.apache.spark.api.java.*; -import org.apache.spark.api.java.function.Function; -import org.apache.spark.mllib.classification.*; -import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics; - -import org.apache.spark.mllib.regression.LabeledP
spark git commit: [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format
Repository: spark Updated Branches: refs/heads/master 99dfcedbf -> b33261f91 [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the tree module. closes #10601 Author: Bryan Cutler Author: vijaykiran Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b33261f9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b33261f9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b33261f9 Branch: refs/heads/master Commit: b33261f91387904c5aaccae40f86922c92a4e09a Parents: 99dfced Author: Bryan Cutler Authored: Fri Feb 26 08:30:32 2016 -0800 Committer: Xiangrui Meng Committed: Fri Feb 26 08:30:32 2016 -0800 -- docs/mllib-decision-tree.md | 6 +- .../apache/spark/mllib/tree/DecisionTree.scala | 172 +- .../spark/mllib/tree/GradientBoostedTrees.scala | 18 +- .../apache/spark/mllib/tree/RandomForest.scala | 69 ++-- .../mllib/tree/configuration/Strategy.scala | 13 +- python/pyspark/mllib/tree.py| 339 +++ 6 files changed, 340 insertions(+), 277 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b33261f9/docs/mllib-decision-tree.md -- diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md index a8612b6..9af4835 100644 --- a/docs/mllib-decision-tree.md +++ b/docs/mllib-decision-tree.md @@ -121,12 +121,12 @@ The parameters are listed below roughly in order of descending importance. New These parameters describe the problem you want to solve and your dataset. They should be specified and do not require tuning. -* **`algo`**: `Classification` or `Regression` +* **`algo`**: Type of decision tree, either `Classification` or `Regression`. -* **`numClasses`**: Number of classes (for `Classification` only) +* **`numClasses`**: Number of classes (for `Classification` only). * **`categoricalFeaturesInfo`**: Specifies which features are categorical and how many categorical values each of those features can take. This is given as a map from feature indices to feature arity (number of categories). Any features not in this map are treated as continuous. - * E.g., `Map(0 -> 2, 4 -> 10)` specifies that feature `0` is binary (taking values `0` or `1`) and that feature `4` has 10 categories (values `{0, 1, ..., 9}`). Note that feature indices are 0-based: features `0` and `4` are the 1st and 5th elements of an instance's feature vector. + * For example, `Map(0 -> 2, 4 -> 10)` specifies that feature `0` is binary (taking values `0` or `1`) and that feature `4` has 10 categories (values `{0, 1, ..., 9}`). Note that feature indices are 0-based: features `0` and `4` are the 1st and 5th elements of an instance's feature vector. * Note that you do not have to specify `categoricalFeaturesInfo`. The algorithm will still run and may get reasonable results. However, performance should be better if categorical features are properly designated. ### Stopping criteria http://git-wip-us.apache.org/repos/asf/spark/blob/b33261f9/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala index 51235a2..40440d5 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala @@ -38,8 +38,9 @@ import org.apache.spark.util.random.XORShiftRandom /** * A class which implements a decision tree learning algorithm for classification and regression. * It supports both continuous and categorical features. + * * @param strategy The configuration parameters for the tree algorithm which specify the type - * of algorithm (classification, regression, etc.), feature type (continuous, + * of decision tree (classification or regression), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. */ @Since("1.0.0") @@ -50,8 +51,8 @@ class DecisionTree @Since("1.0.0") (private val strategy: Strategy) /** * Method to train a decision tree model over an RDD - * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] - * @return DecisionTreeModel that can be used for prediction + * @param input Training data: RDD of [[org.ap
spark git commit: [SPARK-13457][SQL] Removes DataFrame RDD operations
Repository: spark Updated Branches: refs/heads/master 5c3912e5c -> 99dfcedbf [SPARK-13457][SQL] Removes DataFrame RDD operations ## What changes were proposed in this pull request? This is another try of PR #11323. This PR removes DataFrame RDD operations except for `foreach` and `foreachPartitions` (they are actions rather than transformations). Original calls are now replaced by calls to methods of `DataFrame.rdd`. PR #11323 was reverted because it introduced a regression: both `DataFrame.foreach` and `DataFrame.foreachPartitions` wrap underlying RDD operations with `withNewExecutionId` to track Spark jobs. But they are removed in #11323. ## How was the this patch tested? No extra tests are added. Existing tests should do the work. Author: Cheng Lian Closes #11388 from liancheng/remove-df-rdd-ops. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99dfcedb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99dfcedb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99dfcedb Branch: refs/heads/master Commit: 99dfcedbfd4c83c7b6a343456f03e8c6e29968c5 Parents: 5c3912e Author: Cheng Lian Authored: Sat Feb 27 00:28:30 2016 +0800 Committer: Cheng Lian Committed: Sat Feb 27 00:28:30 2016 +0800 -- .../spark/examples/ml/DataFrameExample.scala| 2 +- .../spark/examples/ml/DecisionTreeExample.scala | 8 +++ .../spark/examples/ml/OneVsRestExample.scala| 2 +- .../spark/examples/mllib/LDAExample.scala | 1 + .../apache/spark/examples/sql/RDDRelation.scala | 2 +- .../spark/examples/sql/hive/HiveFromSpark.scala | 2 +- .../scala/org/apache/spark/ml/Predictor.scala | 6 +++-- .../ml/classification/LogisticRegression.scala | 13 +++ .../spark/ml/clustering/BisectingKMeans.scala | 4 ++-- .../org/apache/spark/ml/clustering/KMeans.scala | 6 ++--- .../org/apache/spark/ml/clustering/LDA.scala| 1 + .../BinaryClassificationEvaluator.scala | 9 .../MulticlassClassificationEvaluator.scala | 6 ++--- .../ml/evaluation/RegressionEvaluator.scala | 3 ++- .../apache/spark/ml/feature/ChiSqSelector.scala | 2 +- .../spark/ml/feature/CountVectorizer.scala | 2 +- .../scala/org/apache/spark/ml/feature/IDF.scala | 2 +- .../apache/spark/ml/feature/MaxAbsScaler.scala | 2 +- .../apache/spark/ml/feature/MinMaxScaler.scala | 2 +- .../apache/spark/ml/feature/OneHotEncoder.scala | 2 +- .../scala/org/apache/spark/ml/feature/PCA.scala | 2 +- .../spark/ml/feature/StandardScaler.scala | 2 +- .../apache/spark/ml/feature/StringIndexer.scala | 1 + .../apache/spark/ml/feature/VectorIndexer.scala | 2 +- .../org/apache/spark/ml/feature/Word2Vec.scala | 2 +- .../apache/spark/ml/recommendation/ALS.scala| 1 + .../ml/regression/AFTSurvivalRegression.scala | 2 +- .../ml/regression/IsotonicRegression.scala | 6 ++--- .../spark/ml/regression/LinearRegression.scala | 16 - .../spark/mllib/api/python/PythonMLLibAPI.scala | 8 +++ .../spark/mllib/clustering/KMeansModel.scala| 2 +- .../spark/mllib/clustering/LDAModel.scala | 4 ++-- .../clustering/PowerIterationClustering.scala | 2 +- .../BinaryClassificationMetrics.scala | 2 +- .../mllib/evaluation/MulticlassMetrics.scala| 2 +- .../mllib/evaluation/MultilabelMetrics.scala| 4 +++- .../mllib/evaluation/RegressionMetrics.scala| 2 +- .../spark/mllib/feature/ChiSqSelector.scala | 2 +- .../org/apache/spark/mllib/fpm/FPGrowth.scala | 2 +- .../MatrixFactorizationModel.scala | 12 +- .../mllib/tree/model/DecisionTreeModel.scala| 2 +- .../mllib/tree/model/treeEnsembleModels.scala | 2 +- .../LogisticRegressionSuite.scala | 2 +- .../MultilayerPerceptronClassifierSuite.scala | 5 ++-- .../ml/classification/OneVsRestSuite.scala | 6 ++--- .../ml/clustering/BisectingKMeansSuite.scala| 3 ++- .../spark/ml/clustering/KMeansSuite.scala | 3 ++- .../apache/spark/ml/clustering/LDASuite.scala | 2 +- .../spark/ml/feature/OneHotEncoderSuite.scala | 4 ++-- .../spark/ml/feature/StringIndexerSuite.scala | 6 ++--- .../spark/ml/feature/VectorIndexerSuite.scala | 5 ++-- .../apache/spark/ml/feature/Word2VecSuite.scala | 8 +++ .../spark/ml/recommendation/ALSSuite.scala | 7 +++--- .../spark/ml/regression/GBTRegressorSuite.scala | 2 +- .../ml/regression/IsotonicRegressionSuite.scala | 6 ++--- .../ml/regression/LinearRegressionSuite.scala | 17 +++--- .../scala/org/apache/spark/sql/DataFrame.scala | 24 .../org/apache/spark/sql/GroupedData.scala | 1 + .../org/apache/spark/sql/api/r/SQLUtils.scala | 2 +- .../spark/sql/DataFrameAggregateSuite.scala | 2 +- .../parquet/ParquetFilterSuite.scala| 2 +- .../datasources/parquet/Par
spark git commit: [SPARK-12523][YARN] Support long-running of the Spark On HBase and hive meta store.
Repository: spark Updated Branches: refs/heads/master 318bf4115 -> 5c3912e5c [SPARK-12523][YARN] Support long-running of the Spark On HBase and hive meta store. Obtain the hive metastore and hbase token as well as hdfs token in DelegationToeknRenewer to supoort long-running application of spark on hbase or thriftserver. Author: huangzhaowei Closes #10645 from SaintBacchus/SPARK-12523. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5c3912e5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5c3912e5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5c3912e5 Branch: refs/heads/master Commit: 5c3912e5c90ce659146c3056430d100604378b71 Parents: 318bf41 Author: huangzhaowei Authored: Fri Feb 26 07:32:07 2016 -0600 Committer: Tom Graves Committed: Fri Feb 26 07:32:07 2016 -0600 -- .../deploy/yarn/AMDelegationTokenRenewer.scala | 2 + .../org/apache/spark/deploy/yarn/Client.scala | 42 +--- .../spark/deploy/yarn/YarnSparkHadoopUtil.scala | 38 ++ 3 files changed, 42 insertions(+), 40 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5c3912e5/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala index 2ac9e33..70b67d2 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/AMDelegationTokenRenewer.scala @@ -172,6 +172,8 @@ private[yarn] class AMDelegationTokenRenewer( override def run(): Void = { val nns = YarnSparkHadoopUtil.get.getNameNodesToAccess(sparkConf) + dst hadoopUtil.obtainTokensForNamenodes(nns, freshHadoopConf, tempCreds) +hadoopUtil.obtainTokenForHiveMetastore(sparkConf, freshHadoopConf, tempCreds) +hadoopUtil.obtainTokenForHBase(sparkConf, freshHadoopConf, tempCreds) null } }) http://git-wip-us.apache.org/repos/asf/spark/blob/5c3912e5/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index 530f1d7..dac3ea2 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -345,8 +345,8 @@ private[spark] class Client( // multiple times, YARN will fail to launch containers for the app with an internal // error. val distributedUris = new HashSet[String] -obtainTokenForHiveMetastore(sparkConf, hadoopConf, credentials) -obtainTokenForHBase(sparkConf, hadoopConf, credentials) +YarnSparkHadoopUtil.get.obtainTokenForHiveMetastore(sparkConf, hadoopConf, credentials) +YarnSparkHadoopUtil.get.obtainTokenForHBase(sparkConf, hadoopConf, credentials) val replication = sparkConf.getInt("spark.yarn.submit.file.replication", fs.getDefaultReplication(dst)).toShort @@ -1358,35 +1358,6 @@ object Client extends Logging { } /** - * Obtains token for the Hive metastore and adds them to the credentials. - */ - private def obtainTokenForHiveMetastore( - sparkConf: SparkConf, - conf: Configuration, - credentials: Credentials) { -if (shouldGetTokens(sparkConf, "hive") && UserGroupInformation.isSecurityEnabled) { - YarnSparkHadoopUtil.get.obtainTokenForHiveMetastore(conf).foreach { -credentials.addToken(new Text("hive.server2.delegation.token"), _) - } -} - } - - /** - * Obtain a security token for HBase. - */ - def obtainTokenForHBase( - sparkConf: SparkConf, - conf: Configuration, - credentials: Credentials): Unit = { -if (shouldGetTokens(sparkConf, "hbase") && UserGroupInformation.isSecurityEnabled) { - YarnSparkHadoopUtil.get.obtainTokenForHBase(conf).foreach { token => -credentials.addToken(token.getService, token) -logInfo("Added HBase security token to credentials.") - } -} - } - - /** * Return whether the two file systems are the same. */ private def compareFs(srcFs: FileSystem, destFs: FileSystem): Boolean = { @@ -1450,13 +1421,4 @@ object Client extends Logging { components.mkString(Path.SEPARATOR) } - /** - * Return whether delegation tokens should be retrieved for the given service when security is - * enabled. By default, tokens are retrieved, but that behavior can be changed by setting - * a service-specific configuratio
spark git commit: [MINOR][STREAMING] Fix a minor naming issue in JavaDStreamLike
Repository: spark Updated Branches: refs/heads/master 9812a24aa -> 318bf4115 [MINOR][STREAMING] Fix a minor naming issue in JavaDStreamLike Author: Liwei Lin Closes #11385 from proflin/Fix-minor-naming. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/318bf411 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/318bf411 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/318bf411 Branch: refs/heads/master Commit: 318bf41158a670e9d62123ea0cb27a833affae24 Parents: 9812a24 Author: Liwei Lin Authored: Fri Feb 26 00:21:53 2016 -0800 Committer: Reynold Xin Committed: Fri Feb 26 00:21:53 2016 -0800 -- .../org/apache/spark/streaming/api/java/JavaDStreamLike.scala | 2 ++ 1 file changed, 2 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/318bf411/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala -- diff --git a/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala b/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala index 65aab2f..43632f3 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala @@ -50,6 +50,8 @@ trait JavaDStreamLike[T, This <: JavaDStreamLike[T, This, R], R <: JavaRDDLike[T def wrapRDD(in: RDD[T]): R + // This is just unfortunate we made a mistake in naming -- should be scalaLongToJavaLong. + // Don't fix this for now as it would break binary compatibility. implicit def scalaIntToJavaLong(in: DStream[Long]): JavaDStream[jl.Long] = { in.map(jl.Long.valueOf) } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org