date:20171102

spark git commit: [SPARK-22306][SQL] alter table schema should not erase the bucketing metadata at hive side

2017-11-02 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 882079f5c -> 2fd12af43


[SPARK-22306][SQL] alter table schema should not erase the bucketing metadata 
at hive side

forward-port https://github.com/apache/spark/pull/19622 to master branch.

This bug doesn't exist in master because we've added hive bucketing support and 
the hive bucketing metadata can be recognized by Spark, but we should still 
port it to master: 1) there may be other unsupported hive metadata removed by 
Spark. 2) reduce code difference between master and 2.2 to ease the backport in 
the feature.

***

When we alter table schema, we set the new schema to spark `CatalogTable`, 
convert it to hive table, and finally call `hive.alterTable`. This causes a 
problem in Spark 2.2, because hive bucketing metedata is not recognized by 
Spark, which means a Spark `CatalogTable` representing a hive table is always 
non-bucketed, and when we convert it to hive table and call `hive.alterTable`, 
the original hive bucketing metadata will be removed.

To fix this bug, we should read out the raw hive table metadata, update its 
schema, and call `hive.alterTable`. By doing this we can guarantee only the 
schema is changed, and nothing else.

Author: Wenchen Fan 

Closes #19644 from cloud-fan/infer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2fd12af4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2fd12af4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2fd12af4

Branch: refs/heads/master
Commit: 2fd12af4372a1e2c3faf0eb5d0a1cf530abc0016
Parents: 882079f
Author: Wenchen Fan 
Authored: Thu Nov 2 23:41:16 2017 +0100
Committer: Wenchen Fan 
Committed: Thu Nov 2 23:41:16 2017 +0100

--
 .../sql/catalyst/catalog/ExternalCatalog.scala  | 12 +++--
 .../sql/catalyst/catalog/InMemoryCatalog.scala  |  7 +--
 .../sql/catalyst/catalog/SessionCatalog.scala   | 23 +-
 .../catalyst/catalog/ExternalCatalogSuite.scala | 10 ++---
 .../catalyst/catalog/SessionCatalogSuite.scala  |  8 ++--
 .../spark/sql/execution/command/ddl.scala   | 14 +++---
 .../spark/sql/execution/command/tables.scala| 14 +++---
 .../datasources/DataSourceStrategy.scala|  4 +-
 .../datasources/orc/OrcFileFormat.scala |  5 +--
 .../datasources/parquet/ParquetFileFormat.scala |  5 +--
 .../parquet/ParquetSchemaConverter.scala|  5 +--
 .../spark/sql/hive/HiveExternalCatalog.scala| 47 
 .../spark/sql/hive/HiveMetastoreCatalog.scala   | 11 +++--
 .../apache/spark/sql/hive/HiveStrategies.scala  |  4 +-
 .../spark/sql/hive/client/HiveClient.scala  | 11 +
 .../spark/sql/hive/client/HiveClientImpl.scala  | 45 ---
 .../sql/hive/HiveExternalCatalogSuite.scala | 18 
 .../sql/hive/MetastoreDataSourcesSuite.scala|  4 +-
 .../sql/hive/execution/Hive_2_1_DDLSuite.scala  |  2 +-
 19 files changed, 146 insertions(+), 103 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2fd12af4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
index d4c58db..223094d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
@@ -150,17 +150,15 @@ abstract class ExternalCatalog
   def alterTable(tableDefinition: CatalogTable): Unit
 
   /**
-   * Alter the schema of a table identified by the provided database and table 
name. The new schema
-   * should still contain the existing bucket columns and partition columns 
used by the table. This
-   * method will also update any Spark SQL-related parameters stored as Hive 
table properties (such
-   * as the schema itself).
+   * Alter the data schema of a table identified by the provided database and 
table name. The new
+   * data schema should not have conflict column names with the existing 
partition columns, and
+   * should still contain all the existing data columns.
*
* @param db Database that table to alter schema for exists in
* @param table Name of table to alter schema for
-   * @param schema Updated schema to be used for the table (must contain 
existing partition and
-   *   bucket columns)
+   * @param newDataSchema Updated data schema to be used for the table.
*/
-  def alterTableSchema(db: String, table: String, schema: StructType): Unit
+  def alterTableDataSchema(db: String, table: String,

[spark] Git Push Summary

2017-11-02 Thread holden

Repository: spark
Updated Tags:  refs/tags/v2.1.2 [created] 2abaea9e4

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22243][DSTREAM] spark.yarn.jars should reload from config when checkpoint recovery

2017-11-02 Thread zsxwing

Repository: spark
Updated Branches:
  refs/heads/master e3f67a97f -> 882079f5c


[SPARK-22243][DSTREAM] spark.yarn.jars should reload from config when 
checkpoint recovery

## What changes were proposed in this pull request?
the previous [PR](https://github.com/apache/spark/pull/19469) is deleted by 
mistake.
the solution is straight forward.
adding  "spark.yarn.jars" to propertiesToReload so this property will load from 
config.

## How was this patch tested?

manual tests

Author: ZouChenjun 

Closes #19637 from ChenjunZou/checkpoint-yarn-jars.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/882079f5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/882079f5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/882079f5

Branch: refs/heads/master
Commit: 882079f5c6de40913a27873f2fa3306e5d827393
Parents: e3f67a9
Author: ZouChenjun 
Authored: Thu Nov 2 11:06:37 2017 -0700
Committer: Shixiong Zhu 
Committed: Thu Nov 2 11:06:37 2017 -0700

--
 .../src/main/scala/org/apache/spark/streaming/Checkpoint.scala  | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/882079f5/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala 
b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
index 40a0b8e..9ebb91b 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
@@ -53,6 +53,7 @@ class Checkpoint(ssc: StreamingContext, val checkpointTime: 
Time)
   "spark.driver.host",
   "spark.driver.port",
   "spark.master",
+  "spark.yarn.jars",
   "spark.yarn.keytab",
   "spark.yarn.principal",
   "spark.yarn.credentials.file",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22416][SQL] Move OrcOptions from `sql/hive` to `sql/core`

2017-11-02 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 41b60125b -> e3f67a97f


[SPARK-22416][SQL] Move OrcOptions from `sql/hive` to `sql/core`

## What changes were proposed in this pull request?

According to the 
[discussion](https://github.com/apache/spark/pull/19571#issuecomment-339472976) 
on SPARK-15474, we will add new OrcFileFormat in `sql/core` module and allow 
users to use both old and new OrcFileFormat.

To do that, `OrcOptions` should be visible in `sql/core` module, too. 
Previously, it was `private[orc]` in `sql/hive`. This PR removes `private[orc]` 
because we don't use `private[sql]` in `sql/execution` package after 
[SPARK-16964](https://github.com/apache/spark/pull/14554).

## How was this patch tested?

Pass the Jenkins with the existing tests.

Author: Dongjoon Hyun 

Closes #19636 from dongjoon-hyun/SPARK-22416.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e3f67a97
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e3f67a97
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e3f67a97

Branch: refs/heads/master
Commit: e3f67a97f126abfb7eeb864f657bfc9221bb195e
Parents: 41b6012
Author: Dongjoon Hyun 
Authored: Thu Nov 2 18:28:56 2017 +0100
Committer: Wenchen Fan 
Committed: Thu Nov 2 18:28:56 2017 +0100

--
 .../execution/datasources/orc/OrcOptions.scala  | 70 
 .../spark/sql/hive/orc/OrcFileFormat.scala  |  1 +
 .../apache/spark/sql/hive/orc/OrcOptions.scala  | 70 
 .../spark/sql/hive/orc/OrcSourceSuite.scala |  1 +
 4 files changed, 72 insertions(+), 70 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e3f67a97/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala
new file mode 100644
index 000..c866dd8
--- /dev/null
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc
+
+import java.util.Locale
+
+import org.apache.orc.OrcConf.COMPRESS
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * Options for the ORC data source.
+ */
+class OrcOptions(
+@transient private val parameters: CaseInsensitiveMap[String],
+@transient private val sqlConf: SQLConf)
+  extends Serializable {
+
+  import OrcOptions._
+
+  def this(parameters: Map[String, String], sqlConf: SQLConf) =
+this(CaseInsensitiveMap(parameters), sqlConf)
+
+  /**
+   * Compression codec to use.
+   * Acceptable values are defined in [[shortOrcCompressionCodecNames]].
+   */
+  val compressionCodec: String = {
+// `compression`, `orc.compress`(i.e., OrcConf.COMPRESS), and 
`spark.sql.orc.compression.codec`
+// are in order of precedence from highest to lowest.
+val orcCompressionConf = parameters.get(COMPRESS.getAttribute)
+val codecName = parameters
+  .get("compression")
+  .orElse(orcCompressionConf)
+  .getOrElse(sqlConf.orcCompressionCodec)
+  .toLowerCase(Locale.ROOT)
+if (!shortOrcCompressionCodecNames.contains(codecName)) {
+  val availableCodecs = 
shortOrcCompressionCodecNames.keys.map(_.toLowerCase(Locale.ROOT))
+  throw new IllegalArgumentException(s"Codec [$codecName] " +
+s"is not available. Available codecs are ${availableCodecs.mkString(", 
")}.")
+}
+shortOrcCompressionCodecNames(codecName)
+  }
+}
+
+object OrcOptions {
+  // The ORC compression short names
+  private val shortOrcCompressionCodecNames = Map(
+"none" -> "NONE",
+"uncompressed" -> "NONE",
+"snappy" -> "SNAPPY",
+"zlib" -> "ZLIB",
+

spark git commit: [SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark

2017-11-02 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master b2463fad7 -> 41b60125b


[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark

## What changes were proposed in this pull request?

This PR proposes to add a link from `spark.catalog(..)` to `Catalog` and expose 
Catalog APIs in PySpark as below:

https://user-images.githubusercontent.com/6477701/32135863-f8e9b040-bc40-11e7-92ad-09c8043a1295.png;>

https://user-images.githubusercontent.com/6477701/32135849-bb257b86-bc40-11e7-9eda-4d58fc1301c2.png;>

Note that this is not shown in the list on the top - 
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql

https://user-images.githubusercontent.com/6477701/32135854-d50fab16-bc40-11e7-9181-812c56fd22f5.png;>

This is basically similar with `DataFrameReader` and `DataFrameWriter`.

## How was this patch tested?

Manually built the doc.

Author: hyukjinkwon 

Closes #19596 from HyukjinKwon/SPARK-22369.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41b60125
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41b60125
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41b60125

Branch: refs/heads/master
Commit: 41b60125b673bad0c133cd5c825d353ac2e6dfd6
Parents: b2463fa
Author: hyukjinkwon 
Authored: Thu Nov 2 15:22:52 2017 +0100
Committer: Reynold Xin 
Committed: Thu Nov 2 15:22:52 2017 +0100

--
 python/pyspark/sql/__init__.py | 3 ++-
 python/pyspark/sql/session.py  | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/41b60125/python/pyspark/sql/__init__.py
--
diff --git a/python/pyspark/sql/__init__.py b/python/pyspark/sql/__init__.py
index 22ec416..c3c06c8 100644
--- a/python/pyspark/sql/__init__.py
+++ b/python/pyspark/sql/__init__.py
@@ -46,6 +46,7 @@ from pyspark.sql.types import Row
 from pyspark.sql.context import SQLContext, HiveContext, UDFRegistration
 from pyspark.sql.session import SparkSession
 from pyspark.sql.column import Column
+from pyspark.sql.catalog import Catalog
 from pyspark.sql.dataframe import DataFrame, DataFrameNaFunctions, 
DataFrameStatFunctions
 from pyspark.sql.group import GroupedData
 from pyspark.sql.readwriter import DataFrameReader, DataFrameWriter
@@ -54,7 +55,7 @@ from pyspark.sql.window import Window, WindowSpec
 
 __all__ = [
 'SparkSession', 'SQLContext', 'HiveContext', 'UDFRegistration',
-'DataFrame', 'GroupedData', 'Column', 'Row',
+'DataFrame', 'GroupedData', 'Column', 'Catalog', 'Row',
 'DataFrameNaFunctions', 'DataFrameStatFunctions', 'Window', 'WindowSpec',
 'DataFrameReader', 'DataFrameWriter'
 ]

http://git-wip-us.apache.org/repos/asf/spark/blob/41b60125/python/pyspark/sql/session.py
--
diff --git a/python/pyspark/sql/session.py b/python/pyspark/sql/session.py
index 2cc0e2d..c3dc1a46 100644
--- a/python/pyspark/sql/session.py
+++ b/python/pyspark/sql/session.py
@@ -271,6 +271,8 @@ class SparkSession(object):
 def catalog(self):
 """Interface through which the user may create, drop, alter or query 
underlying
 databases, tables, functions etc.
+
+:return: :class:`Catalog`
 """
 if not hasattr(self, "_catalog"):
 self._catalog = Catalog(self)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22145][MESOS] fix supervise with checkpointing on mesos

2017-11-02 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master 277b1924b -> b2463fad7


[SPARK-22145][MESOS] fix supervise with checkpointing on mesos

## What changes were proposed in this pull request?

- Fixes the issue with the frameworkId being recovered by checkpointed data 
overwriting the one sent by the dipatcher.
- Keeps submission driver id as the only index for all data structures in the 
dispatcher.
Allocates a different task id per driver retry to satisfy the mesos 
requirements. Check the relevant ticket for the details on that.
## How was this patch tested?

Manually tested this with DC/OS 1.10. Launched a streaming job with 
checkpointing to hdfs, made the driver fail several times and observed behavior:
![image](https://user-images.githubusercontent.com/7945591/30940500-f7d2a744-a3e9-11e7-8c56-f2ccbb271e80.png)

![image](https://user-images.githubusercontent.com/7945591/30940550-19bc15de-a3ea-11e7-8a11-f48abfe36720.png)

![image](https://user-images.githubusercontent.com/7945591/30940524-083ea308-a3ea-11e7-83ae-00d3fa17b928.png)

![image](https://user-images.githubusercontent.com/7945591/30940579-2f0fb242-a3ea-11e7-82f9-86179da28b8c.png)

![image](https://user-images.githubusercontent.com/7945591/30940591-3b561b0e-a3ea-11e7-9dbd-e71912bb2ef3.png)

![image](https://user-images.githubusercontent.com/7945591/30940605-49c810ca-a3ea-11e7-8af5-67930851fd38.png)

![image](https://user-images.githubusercontent.com/7945591/30940631-59f4a288-a3ea-11e7-88cb-c3741b72bb13.png)

![image](https://user-images.githubusercontent.com/7945591/30940642-62346c9e-a3ea-11e7-8935-82e494925f67.png)

![image](https://user-images.githubusercontent.com/7945591/30940653-6c46d53c-a3ea-11e7-8dd1-5840d484d28c.png)

Author: Stavros Kontopoulos 

Closes #19374 from skonto/fix_retry.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2463fad
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2463fad
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2463fad

Branch: refs/heads/master
Commit: b2463fad718d25f564d62c50d587610de3d0c5bd
Parents: 277b192
Author: Stavros Kontopoulos 
Authored: Thu Nov 2 13:25:48 2017 +
Committer: Sean Owen 
Committed: Thu Nov 2 13:25:48 2017 +

--
 .../scala/org/apache/spark/SparkContext.scala   |  1 +
 .../cluster/mesos/MesosClusterScheduler.scala   | 90 
 .../org/apache/spark/streaming/Checkpoint.scala |  3 +-
 3 files changed, 57 insertions(+), 37 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b2463fad/core/src/main/scala/org/apache/spark/SparkContext.scala
--
diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala 
b/core/src/main/scala/org/apache/spark/SparkContext.scala
index 6f25d34..c7dd635 100644
--- a/core/src/main/scala/org/apache/spark/SparkContext.scala
+++ b/core/src/main/scala/org/apache/spark/SparkContext.scala
@@ -310,6 +310,7 @@ class SparkContext(config: SparkConf) extends Logging {
* (i.e.
*  in case of local spark app something like 'local-1433865536131'
*  in case of YARN something like 'application_1433865536131_34483'
+   *  in case of MESOS something like 'driver-20170926223339-0001'
* )
*/
   def applicationId: String = _applicationId

http://git-wip-us.apache.org/repos/asf/spark/blob/b2463fad/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
--
diff --git 
a/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
 
b/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
index 8247026..de846c8 100644
--- 
a/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
+++ 
b/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
@@ -134,22 +134,24 @@ private[spark] class MesosClusterScheduler(
   private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", 
false)
   private val schedulerState = engineFactory.createEngine("scheduler")
   private val stateLock = new Object()
+  // Keyed by submission id
   private val finishedDrivers =
 new mutable.ArrayBuffer[MesosClusterSubmissionState](retainedDrivers)
   private var frameworkId: String = null
-  // Holds all the launched drivers and current launch state, keyed by driver 
id.
+  // Holds all the launched drivers and current launch state, keyed by 
submission id.
   private val launchedDrivers = new mutable.HashMap[String,

spark git commit: [SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

2017-11-02 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 849b465bb -> 277b1924b


[SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation 
launches unnecessary stages

## What changes were proposed in this pull request?

Adding a global limit on top of the distinct values before sorting and 
collecting will reduce the overall work in the case where we have more distinct 
values. We will also eagerly perform a collect rather than a take because we 
know we only have at most (maxValues + 1) rows.

## How was this patch tested?

Existing tests cover sorted order

Author: Patrick Woody 

Closes #19629 from pwoody/SPARK-22408.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/277b1924
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/277b1924
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/277b1924

Branch: refs/heads/master
Commit: 277b1924b46a70ab25414f5670eb784906dbbfdf
Parents: 849b465
Author: Patrick Woody 
Authored: Thu Nov 2 14:19:21 2017 +0100
Committer: Reynold Xin 
Committed: Thu Nov 2 14:19:21 2017 +0100

--
 .../scala/org/apache/spark/sql/RelationalGroupedDataset.scala| 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/277b1924/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
index 21e94fa..3e4edd4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala
@@ -321,10 +321,10 @@ class RelationalGroupedDataset protected[sql](
 // Get the distinct values of the column and sort them so its consistent
 val values = df.select(pivotColumn)
   .distinct()
+  .limit(maxValues + 1)
   .sort(pivotColumn)  // ensure that the output columns are in a 
consistent logical order
-  .rdd
+  .collect()
   .map(_.get(0))
-  .take(maxValues + 1)
   .toSeq
 
 if (values.length > maxValues) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22306][SQL][2.2] alter table schema should not erase the bucketing metadata at hive side

2017-11-02 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.2 c311c5e79 -> 4074ed2e1


[SPARK-22306][SQL][2.2] alter table schema should not erase the bucketing 
metadata at hive side

## What changes were proposed in this pull request?

When we alter table schema, we set the new schema to spark `CatalogTable`, 
convert it to hive table, and finally call `hive.alterTable`. This causes a 
problem in Spark 2.2, because hive bucketing metedata is not recognized by 
Spark, which means a Spark `CatalogTable` representing a hive table is always 
non-bucketed, and when we convert it to hive table and call `hive.alterTable`, 
the original hive bucketing metadata will be removed.

To fix this bug, we should read out the raw hive table metadata, update its 
schema, and call `hive.alterTable`. By doing this we can guarantee only the 
schema is changed, and nothing else.

Note that this bug doesn't exist in the master branch, because we've added hive 
bucketing support and the hive bucketing metadata can be recognized by Spark. I 
think we should merge this PR to master too, for code cleanup and reduce the 
difference between master and 2.2 branch for backporting.

## How was this patch tested?

new regression test

Author: Wenchen Fan 

Closes #19622 from cloud-fan/infer.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4074ed2e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4074ed2e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4074ed2e

Branch: refs/heads/branch-2.2
Commit: 4074ed2e1363c886878bbf9483e21abd1745f482
Parents: c311c5e
Author: Wenchen Fan 
Authored: Thu Nov 2 12:37:52 2017 +0100
Committer: Wenchen Fan 
Committed: Thu Nov 2 12:37:52 2017 +0100

--
 .../sql/catalyst/catalog/ExternalCatalog.scala  | 12 ++---
 .../sql/catalyst/catalog/InMemoryCatalog.scala  |  7 +--
 .../sql/catalyst/catalog/SessionCatalog.scala   | 25 -
 .../catalyst/catalog/ExternalCatalogSuite.scala | 11 ++--
 .../catalyst/catalog/SessionCatalogSuite.scala  | 21 ++--
 .../spark/sql/execution/command/tables.scala| 10 +---
 .../spark/sql/hive/HiveExternalCatalog.scala| 57 +---
 .../spark/sql/hive/HiveMetastoreCatalog.scala   | 11 ++--
 .../spark/sql/hive/client/HiveClient.scala  | 11 
 .../spark/sql/hive/client/HiveClientImpl.scala  | 45 ++--
 .../sql/hive/HiveExternalCatalogSuite.scala | 18 +++
 .../sql/hive/MetastoreDataSourcesSuite.scala|  4 +-
 .../sql/hive/execution/Hive_2_1_DDLSuite.scala  |  2 +-
 13 files changed, 148 insertions(+), 86 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4074ed2e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
index 18644b0..8db6f79 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
@@ -148,17 +148,15 @@ abstract class ExternalCatalog
   def alterTable(tableDefinition: CatalogTable): Unit
 
   /**
-   * Alter the schema of a table identified by the provided database and table 
name. The new schema
-   * should still contain the existing bucket columns and partition columns 
used by the table. This
-   * method will also update any Spark SQL-related parameters stored as Hive 
table properties (such
-   * as the schema itself).
+   * Alter the data schema of a table identified by the provided database and 
table name. The new
+   * data schema should not have conflict column names with the existing 
partition columns, and
+   * should still contain all the existing data columns.
*
* @param db Database that table to alter schema for exists in
* @param table Name of table to alter schema for
-   * @param schema Updated schema to be used for the table (must contain 
existing partition and
-   *   bucket columns)
+   * @param newDataSchema Updated data schema to be used for the table.
*/
-  def alterTableSchema(db: String, table: String, schema: StructType): Unit
+  def alterTableDataSchema(db: String, table: String, newDataSchema: 
StructType): Unit
 
   def getTable(db: String, table: String): CatalogTable
 

http://git-wip-us.apache.org/repos/asf/spark/blob/4074ed2e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
--
diff --git

spark git commit: [SPARK-14650][REPL][BUILD] Compile Spark REPL for Scala 2.12

2017-11-02 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master b04eefae4 -> 849b465bb


[SPARK-14650][REPL][BUILD] Compile Spark REPL for Scala 2.12

## What changes were proposed in this pull request?

Spark REPL changes for Scala 2.12.4: use command(), not processLine() in ILoop; 
remove direct dependence on older jline. Not sure whether this became needed in 
2.12.4 or just missed this before. This makes spark-shell work in 2.12.

## How was this patch tested?

Existing tests; manual run of spark-shell in 2.11, 2.12 builds

Author: Sean Owen 

Closes #19612 from srowen/SPARK-14650.2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/849b465b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/849b465b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/849b465b

Branch: refs/heads/master
Commit: 849b465bbf472d6ca56308fb3ccade86e2244e01
Parents: b04eefa
Author: Sean Owen 
Authored: Thu Nov 2 09:45:34 2017 +
Committer: Sean Owen 
Committed: Thu Nov 2 09:45:34 2017 +

--
 pom.xml  | 15 ++-
 repl/pom.xml |  4 
 .../scala/org/apache/spark/repl/SparkILoop.scala | 13 +
 3 files changed, 15 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/849b465b/pom.xml
--
diff --git a/pom.xml b/pom.xml
index 652aed4..8570338 100644
--- a/pom.xml
+++ b/pom.xml
@@ -735,6 +735,12 @@
 scalap
 ${scala.version}
   
+  
+  
+jline
+jline
+2.12.1
+  
   
 org.scalatest
 scalatest_${scala.binary.version}
@@ -1188,6 +1194,10 @@
 org.jboss.netty
 netty
   
+  
+jline
+jline
+  
 
   
   
@@ -1926,11 +1936,6 @@
 ${antlr4.version}
   
   
-jline
-jline
-2.12.1
-  
-  
 org.apache.commons
 commons-crypto
 ${commons-crypto.version}

http://git-wip-us.apache.org/repos/asf/spark/blob/849b465b/repl/pom.xml
--
diff --git a/repl/pom.xml b/repl/pom.xml
index bd2cfc4..1cb0098 100644
--- a/repl/pom.xml
+++ b/repl/pom.xml
@@ -70,10 +70,6 @@
   scala-reflect
   ${scala.version}
 
-
-  jline
-  jline
-
  
   org.slf4j
   jul-to-slf4j

http://git-wip-us.apache.org/repos/asf/spark/blob/849b465b/repl/scala-2.12/src/main/scala/org/apache/spark/repl/SparkILoop.scala
--
diff --git 
a/repl/scala-2.12/src/main/scala/org/apache/spark/repl/SparkILoop.scala 
b/repl/scala-2.12/src/main/scala/org/apache/spark/repl/SparkILoop.scala
index 4135940..900edd6 100644
--- a/repl/scala-2.12/src/main/scala/org/apache/spark/repl/SparkILoop.scala
+++ b/repl/scala-2.12/src/main/scala/org/apache/spark/repl/SparkILoop.scala
@@ -19,9 +19,6 @@ package org.apache.spark.repl
 
 import java.io.BufferedReader
 
-// scalastyle:off println
-import scala.Predef.{println => _, _}
-// scalastyle:on println
 import scala.tools.nsc.Settings
 import scala.tools.nsc.interpreter.{ILoop, JPrintWriter}
 import scala.tools.nsc.util.stringFromStream
@@ -37,7 +34,7 @@ class SparkILoop(in0: Option[BufferedReader], out: 
JPrintWriter)
 
   def initializeSpark() {
 intp.beQuietDuring {
-  processLine("""
+  command("""
 @transient val spark = if (org.apache.spark.repl.Main.sparkSession != 
null) {
 org.apache.spark.repl.Main.sparkSession
   } else {
@@ -64,10 +61,10 @@ class SparkILoop(in0: Option[BufferedReader], out: 
JPrintWriter)
   _sc
 }
 """)
-  processLine("import org.apache.spark.SparkContext._")
-  processLine("import spark.implicits._")
-  processLine("import spark.sql")
-  processLine("import org.apache.spark.sql.functions._")
+  command("import org.apache.spark.SparkContext._")
+  command("import spark.implicits._")
+  command("import spark.sql")
+  command("import org.apache.spark.sql.functions._")
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-22306][SQL] alter table schema should not erase the bucketing metadata at hive side

[spark] Git Push Summary

spark git commit: [SPARK-22243][DSTREAM] spark.yarn.jars should reload from config when checkpoint recovery

spark git commit: [SPARK-22416][SQL] Move OrcOptions from `sql/hive` to `sql/core`

spark git commit: [SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark

spark git commit: [SPARK-22145][MESOS] fix supervise with checkpointing on mesos

spark git commit: [SPARK-22408][SQL] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

spark git commit: [SPARK-22306][SQL][2.2] alter table schema should not erase the bucketing metadata at hive side

spark git commit: [SPARK-14650][REPL][BUILD] Compile Spark REPL for Scala 2.12

9 matches

Site Navigation

Mail list logo

Footer information