spark git commit: [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last

2016-07-25 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 468a3c3ac -> 68b4020d0


[SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last

## What changes were proposed in this pull request?

Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and 
when both constructor arguments are the same, e.g.:

```sql
LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE
LAST_VALUE(FALSE, FALSE)
LAST_VALUE(TRUE, TRUE)
```

This is because although `Last` is a unary expression, both of its constructor 
arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the 
same value, `TreeNode.withNewChildren` treats both of them as child nodes by 
mistake. `First` is also affected by this issue in exactly the same way.

This PR fixes this issue by making `ignoreNullsExpr` a child expression of 
`First` and `Last`.

## How was this patch tested?

New test case added in `WindowQuerySuite`.

Author: Cheng Lian 

Closes #14295 from liancheng/spark-16648-last-value.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/68b4020d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/68b4020d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/68b4020d

Branch: refs/heads/master
Commit: 68b4020d0c0d4f063facfbf4639ef4251dcfda8b
Parents: 468a3c3
Author: Cheng Lian 
Authored: Mon Jul 25 17:22:29 2016 +0800
Committer: Wenchen Fan 
Committed: Mon Jul 25 17:22:29 2016 +0800

--
 .../sql/catalyst/expressions/aggregate/First.scala  |  4 ++--
 .../spark/sql/catalyst/expressions/aggregate/Last.scala |  4 ++--
 .../spark/sql/hive/execution/WindowQuerySuite.scala | 12 
 3 files changed, 16 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/68b4020d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
index 946b3d4..d702c08 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
@@ -43,7 +43,7 @@ case class First(child: Expression, ignoreNullsExpr: 
Expression) extends Declara
   throw new AnalysisException("The second argument of First should be a 
boolean literal.")
   }
 
-  override def children: Seq[Expression] = child :: Nil
+  override def children: Seq[Expression] = child :: ignoreNullsExpr :: Nil
 
   override def nullable: Boolean = true
 
@@ -54,7 +54,7 @@ case class First(child: Expression, ignoreNullsExpr: 
Expression) extends Declara
   override def dataType: DataType = child.dataType
 
   // Expected input data type.
-  override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, 
BooleanType)
 
   private lazy val first = AttributeReference("first", child.dataType)()
 

http://git-wip-us.apache.org/repos/asf/spark/blob/68b4020d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
index 53b4b76..af88403 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
@@ -40,7 +40,7 @@ case class Last(child: Expression, ignoreNullsExpr: 
Expression) extends Declarat
   throw new AnalysisException("The second argument of First should be a 
boolean literal.")
   }
 
-  override def children: Seq[Expression] = child :: Nil
+  override def children: Seq[Expression] = child :: ignoreNullsExpr :: Nil
 
   override def nullable: Boolean = true
 
@@ -51,7 +51,7 @@ case class Last(child: Expression, ignoreNullsExpr: 
Expression) extends Declarat
   override def dataType: DataType = child.dataType
 
   // Expected input data type.
-  override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, 
BooleanType)
 
   private lazy val last = AttributeReference("last", child.dataType)()
 

http://git-wip-us.apache.org/repos/asf/spark/blob/68b4020d/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/WindowQuerySuite.scala
--

spark git commit: [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last

2016-07-25 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 d226dce12 -> fcbb7f653


[SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last

## What changes were proposed in this pull request?

Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and 
when both constructor arguments are the same, e.g.:

```sql
LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE
LAST_VALUE(FALSE, FALSE)
LAST_VALUE(TRUE, TRUE)
```

This is because although `Last` is a unary expression, both of its constructor 
arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the 
same value, `TreeNode.withNewChildren` treats both of them as child nodes by 
mistake. `First` is also affected by this issue in exactly the same way.

This PR fixes this issue by making `ignoreNullsExpr` a child expression of 
`First` and `Last`.

## How was this patch tested?

New test case added in `WindowQuerySuite`.

Author: Cheng Lian 

Closes #14295 from liancheng/spark-16648-last-value.

(cherry picked from commit 68b4020d0c0d4f063facfbf4639ef4251dcfda8b)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fcbb7f65
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fcbb7f65
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fcbb7f65

Branch: refs/heads/branch-2.0
Commit: fcbb7f653df11d923a208c5af03c0a6b9a472376
Parents: d226dce
Author: Cheng Lian 
Authored: Mon Jul 25 17:22:29 2016 +0800
Committer: Wenchen Fan 
Committed: Mon Jul 25 17:25:19 2016 +0800

--
 .../sql/catalyst/expressions/aggregate/First.scala  |  4 ++--
 .../spark/sql/catalyst/expressions/aggregate/Last.scala |  4 ++--
 .../spark/sql/hive/execution/WindowQuerySuite.scala | 12 
 3 files changed, 16 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fcbb7f65/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
index 946b3d4..d702c08 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala
@@ -43,7 +43,7 @@ case class First(child: Expression, ignoreNullsExpr: 
Expression) extends Declara
   throw new AnalysisException("The second argument of First should be a 
boolean literal.")
   }
 
-  override def children: Seq[Expression] = child :: Nil
+  override def children: Seq[Expression] = child :: ignoreNullsExpr :: Nil
 
   override def nullable: Boolean = true
 
@@ -54,7 +54,7 @@ case class First(child: Expression, ignoreNullsExpr: 
Expression) extends Declara
   override def dataType: DataType = child.dataType
 
   // Expected input data type.
-  override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, 
BooleanType)
 
   private lazy val first = AttributeReference("first", child.dataType)()
 

http://git-wip-us.apache.org/repos/asf/spark/blob/fcbb7f65/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
index 53b4b76..af88403 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala
@@ -40,7 +40,7 @@ case class Last(child: Expression, ignoreNullsExpr: 
Expression) extends Declarat
   throw new AnalysisException("The second argument of First should be a 
boolean literal.")
   }
 
-  override def children: Seq[Expression] = child :: Nil
+  override def children: Seq[Expression] = child :: ignoreNullsExpr :: Nil
 
   override def nullable: Boolean = true
 
@@ -51,7 +51,7 @@ case class Last(child: Expression, ignoreNullsExpr: 
Expression) extends Declarat
   override def dataType: DataType = child.dataType
 
   // Expected input data type.
-  override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType)
+  override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, 
BooleanType)
 
   private lazy val last = AttributeReference("last", child.dataType)()
 

http://git-wip-us.apache.org/repos/asf/spark/blob/fcbb7f65/sql/hive/src/test/scala/org/apache/

spark git commit: [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading

2016-07-25 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 68b4020d0 -> 7ffd99ec5


[SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading

## What changes were proposed in this pull request?

Currently, `JDBCRDD.compute` is doing type dispatch for each row to read 
appropriate values.
It might not have to be done like this because the schema is already kept in 
`JDBCRDD`.

So, appropriate converters can be created first according to the schema, and 
then apply them to each row.

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon 

Closes #14313 from HyukjinKwon/SPARK-16674.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ffd99ec
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ffd99ec
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ffd99ec

Branch: refs/heads/master
Commit: 7ffd99ec5f267730734431097cbb700ad074bebe
Parents: 68b4020
Author: hyukjinkwon 
Authored: Mon Jul 25 19:57:47 2016 +0800
Committer: Wenchen Fan 
Committed: Mon Jul 25 19:57:47 2016 +0800

--
 .../execution/datasources/jdbc/JDBCRDD.scala| 245 ++-
 1 file changed, 129 insertions(+), 116 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7ffd99ec/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
index 24e2c1a..4c98430 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
@@ -28,7 +28,7 @@ import org.apache.spark.{Partition, SparkContext, TaskContext}
 import org.apache.spark.internal.Logging
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.expressions.SpecificMutableRow
+import org.apache.spark.sql.catalyst.expressions.{MutableRow, 
SpecificMutableRow}
 import org.apache.spark.sql.catalyst.util.{DateTimeUtils, GenericArrayData}
 import org.apache.spark.sql.jdbc.JdbcDialects
 import org.apache.spark.sql.sources._
@@ -322,43 +322,134 @@ private[sql] class JDBCRDD(
 }
   }
 
-  // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that
-  // we don't have to potentially poke around in the Metadata once for every
-  // row.
-  // Is there a better way to do this?  I'd rather be using a type that
-  // contains only the tags I define.
-  abstract class JDBCConversion
-  case object BooleanConversion extends JDBCConversion
-  case object DateConversion extends JDBCConversion
-  case class  DecimalConversion(precision: Int, scale: Int) extends 
JDBCConversion
-  case object DoubleConversion extends JDBCConversion
-  case object FloatConversion extends JDBCConversion
-  case object IntegerConversion extends JDBCConversion
-  case object LongConversion extends JDBCConversion
-  case object BinaryLongConversion extends JDBCConversion
-  case object StringConversion extends JDBCConversion
-  case object TimestampConversion extends JDBCConversion
-  case object BinaryConversion extends JDBCConversion
-  case class ArrayConversion(elementConversion: JDBCConversion) extends 
JDBCConversion
+  // A `JDBCValueSetter` is responsible for converting and setting a value 
from `ResultSet`
+  // into a field for `MutableRow`. The last argument `Int` means the index 
for the
+  // value to be set in the row and also used for the value to retrieve from 
`ResultSet`.
+  private type JDBCValueSetter = (ResultSet, MutableRow, Int) => Unit
 
   /**
-   * Maps a StructType to a type tag list.
+   * Creates `JDBCValueSetter`s according to [[StructType]], which can set
+   * each value from `ResultSet` to each field of [[MutableRow]] correctly.
*/
-  def getConversions(schema: StructType): Array[JDBCConversion] =
-schema.fields.map(sf => getConversions(sf.dataType, sf.metadata))
-
-  private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion 
= dt match {
-case BooleanType => BooleanConversion
-case DateType => DateConversion
-case DecimalType.Fixed(p, s) => DecimalConversion(p, s)
-case DoubleType => DoubleConversion
-case FloatType => FloatConversion
-case IntegerType => IntegerConversion
-case LongType => if (metadata.contains("binarylong")) BinaryLongConversion 
else LongConversion
-case StringType => StringConversion
-case TimestampType => TimestampConversion
-case BinaryType => BinaryConversion
-case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata))
+  

spark git commit: [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable

2016-07-25 Thread lian
Repository: spark
Updated Branches:
  refs/heads/master 7ffd99ec5 -> d27d362eb


[SPARK-16660][SQL] CreateViewCommand should not take CatalogTable

## What changes were proposed in this pull request?

`CreateViewCommand` only needs some information of a `CatalogTable`, but not 
all of them. We have some tricks(e.g. we need to check the table type is 
`VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take 
a `CatalogTable`.
This PR cleans it up and only pass in necessary information to 
`CreateViewCommand`.

## How was this patch tested?

existing tests.

Author: Wenchen Fan 

Closes #14297 from cloud-fan/minor2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d27d362e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d27d362e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d27d362e

Branch: refs/heads/master
Commit: d27d362ebae0c4a5cc6c99f13ef20049214dd4f9
Parents: 7ffd99e
Author: Wenchen Fan 
Authored: Mon Jul 25 22:02:00 2016 +0800
Committer: Cheng Lian 
Committed: Mon Jul 25 22:02:00 2016 +0800

--
 .../spark/sql/catalyst/catalog/interface.scala  |   6 +-
 .../scala/org/apache/spark/sql/Dataset.scala|  27 ++---
 .../spark/sql/execution/SparkSqlParser.scala|  51 -
 .../spark/sql/execution/command/views.scala | 111 ++-
 .../spark/sql/hive/HiveMetastoreCatalog.scala   |   2 -
 .../spark/sql/hive/HiveDDLCommandSuite.scala|  46 +++-
 6 files changed, 116 insertions(+), 127 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d27d362e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
index b7f35b3..2a20651 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
@@ -81,9 +81,9 @@ object CatalogStorageFormat {
  */
 case class CatalogColumn(
 name: String,
-// This may be null when used to create views. TODO: make this type-safe; 
this is left
-// as a string due to issues in converting Hive varchars to and from 
SparkSQL strings.
-@Nullable dataType: String,
+// TODO: make this type-safe; this is left as a string due to issues in 
converting Hive
+// varchars to and from SparkSQL strings.
+dataType: String,
 nullable: Boolean = true,
 comment: Option[String] = None) {
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d27d362e/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index b28ecb7..8b6443c 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -2421,13 +2421,7 @@ class Dataset[T] private[sql](
*/
   @throws[AnalysisException]
   def createTempView(viewName: String): Unit = withPlan {
-val tableDesc = CatalogTable(
-  identifier = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName),
-  tableType = CatalogTableType.VIEW,
-  schema = Seq.empty[CatalogColumn],
-  storage = CatalogStorageFormat.empty)
-CreateViewCommand(tableDesc, logicalPlan, allowExisting = false, replace = 
false,
-  isTemporary = true)
+createViewCommand(viewName, replace = false)
   }
 
   /**
@@ -2438,12 +2432,19 @@ class Dataset[T] private[sql](
* @since 2.0.0
*/
   def createOrReplaceTempView(viewName: String): Unit = withPlan {
-val tableDesc = CatalogTable(
-  identifier = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName),
-  tableType = CatalogTableType.VIEW,
-  schema = Seq.empty[CatalogColumn],
-  storage = CatalogStorageFormat.empty)
-CreateViewCommand(tableDesc, logicalPlan, allowExisting = false, replace = 
true,
+createViewCommand(viewName, replace = true)
+  }
+
+  private def createViewCommand(viewName: String, replace: Boolean): 
CreateViewCommand = {
+CreateViewCommand(
+  name = 
sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName),
+  userSpecifiedColumns = Nil,
+  comment = None,
+  properties = Map.empty,
+  originalText = None,
+  child = logicalPlan,
+  allowExisting = false,
+  replace = replace,
   isTemporary = true)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/d27d362e/sql/core/src/main/scala/org/apache

spark git commit: [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable

2016-07-25 Thread lian
Repository: spark
Updated Branches:
  refs/heads/master d27d362eb -> 64529b186


[SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable

## What changes were proposed in this pull request?

It's weird that we have `BucketSpec` to abstract bucket info, but don't use it 
in `CatalogTable`. This PR moves `BucketSpec` into catalyst module.

## How was this patch tested?

existing tests.

Author: Wenchen Fan 

Closes #14331 from cloud-fan/check.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/64529b18
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/64529b18
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/64529b18

Branch: refs/heads/master
Commit: 64529b186a1c33740067cc7639d630bc5b9ae6e8
Parents: d27d362
Author: Wenchen Fan 
Authored: Mon Jul 25 22:05:48 2016 +0800
Committer: Cheng Lian 
Committed: Mon Jul 25 22:05:48 2016 +0800

--
 .../spark/sql/catalyst/catalog/interface.scala  | 49 
 .../catalyst/catalog/ExternalCatalogSuite.scala |  2 +-
 .../org/apache/spark/sql/DataFrameWriter.scala  |  5 +-
 .../spark/sql/execution/command/ddl.scala   |  3 +-
 .../spark/sql/execution/command/tables.scala| 30 +-
 .../execution/datasources/BucketingUtils.scala  | 39 +
 .../sql/execution/datasources/DataSource.scala  |  1 +
 .../datasources/FileSourceStrategy.scala|  1 +
 .../InsertIntoHadoopFsRelationCommand.scala |  2 +-
 .../execution/datasources/WriterContainer.scala |  1 +
 .../sql/execution/datasources/bucket.scala  | 59 
 .../spark/sql/execution/datasources/ddl.scala   |  1 +
 .../datasources/fileSourceInterfaces.scala  |  2 +-
 .../apache/spark/sql/internal/CatalogImpl.scala |  2 +-
 .../sql/execution/command/DDLCommandSuite.scala |  6 +-
 .../spark/sql/execution/command/DDLSuite.scala  |  3 +-
 .../datasources/FileSourceStrategySuite.scala   |  1 +
 .../spark/sql/internal/CatalogSuite.scala   |  5 +-
 .../sql/sources/CreateTableAsSelectSuite.scala  |  2 +-
 .../spark/sql/hive/client/HiveClientImpl.scala  |  9 +--
 .../spark/sql/hive/HiveDDLCommandSuite.scala|  8 +--
 .../spark/sql/sources/BucketedReadSuite.scala   |  3 +-
 22 files changed, 117 insertions(+), 117 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/64529b18/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
index 2a20651..710bce5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
@@ -25,6 +25,7 @@ import org.apache.spark.sql.catalyst.{FunctionIdentifier, 
TableIdentifier}
 import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
 import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan}
+import org.apache.spark.sql.catalyst.util.quoteIdentifier
 
 
 /**
@@ -110,6 +111,24 @@ case class CatalogTablePartition(
 
 
 /**
+ * A container for bucketing information.
+ * Bucketing is a technology for decomposing data sets into more manageable 
parts, and the number
+ * of buckets is fixed so it does not fluctuate with data.
+ *
+ * @param numBuckets number of buckets.
+ * @param bucketColumnNames the names of the columns that used to generate the 
bucket id.
+ * @param sortColumnNames the names of the columns that used to sort data in 
each bucket.
+ */
+case class BucketSpec(
+numBuckets: Int,
+bucketColumnNames: Seq[String],
+sortColumnNames: Seq[String]) {
+  if (numBuckets <= 0) {
+throw new AnalysisException(s"Expected positive number of buckets, but got 
`$numBuckets`.")
+  }
+}
+
+/**
  * A table defined in the catalog.
  *
  * Note that Hive's metastore also tracks skewed columns. We should consider 
adding that in the
@@ -124,9 +143,7 @@ case class CatalogTable(
 storage: CatalogStorageFormat,
 schema: Seq[CatalogColumn],
 partitionColumnNames: Seq[String] = Seq.empty,
-sortColumnNames: Seq[String] = Seq.empty,
-bucketColumnNames: Seq[String] = Seq.empty,
-numBuckets: Int = -1,
+bucketSpec: Option[BucketSpec] = None,
 owner: String = "",
 createTime: Long = System.currentTimeMillis,
 lastAccessTime: Long = -1,
@@ -143,8 +160,8 @@ case class CatalogTable(
   s"must be a subset of schema (${colNames.mkString(", ")}) in table 
'$identifier'")
   }
   requireSubsetOfSchema(partitionColumnNames, "partition")
-  requireSubsetOfSch

spark git commit: [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages

2016-07-25 Thread lian
Repository: spark
Updated Branches:
  refs/heads/master 64529b186 -> d6a52176a


[SPARK-16668][TEST] Test parquet reader for row groups containing both 
dictionary and plain encoded pages

## What changes were proposed in this pull request?

This patch adds an explicit test for [SPARK-14217] by setting the parquet 
dictionary and page size the generated parquet file spans across 3 pages 
(within a single row group) where the first page is dictionary encoded and the 
remaining two are plain encoded.

## How was this patch tested?

1. ParquetEncodingSuite
2. Also manually tested that this test fails without 
https://github.com/apache/spark/pull/12279

Author: Sameer Agarwal 

Closes #14304 from sameeragarwal/hybrid-encoding-test.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d6a52176
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d6a52176
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d6a52176

Branch: refs/heads/master
Commit: d6a52176ade92853f37167ad27631977dc79bc76
Parents: 64529b1
Author: Sameer Agarwal 
Authored: Mon Jul 25 22:31:01 2016 +0800
Committer: Cheng Lian 
Committed: Mon Jul 25 22:31:01 2016 +0800

--
 .../parquet/ParquetEncodingSuite.scala  | 29 
 1 file changed, 29 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d6a52176/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
index 88fcfce..c754188 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
@@ -16,6 +16,10 @@
  */
 package org.apache.spark.sql.execution.datasources.parquet
 
+import scala.collection.JavaConverters._
+
+import org.apache.parquet.hadoop.ParquetOutputFormat
+
 import org.apache.spark.sql.test.SharedSQLContext
 
 // TODO: this needs a lot more testing but it's currently not easy to test 
with the parquet
@@ -78,4 +82,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest 
with SharedSQLContex
   }}
 }
   }
+
+  test("Read row group containing both dictionary and plain encoded pages") {
+withSQLConf(ParquetOutputFormat.DICTIONARY_PAGE_SIZE -> "2048",
+  ParquetOutputFormat.PAGE_SIZE -> "4096") {
+  withTempPath { dir =>
+// In order to explicitly test for SPARK-14217, we set the parquet 
dictionary and page size
+// such that the following data spans across 3 pages (within a single 
row group) where the
+// first page is dictionary encoded and the remaining two are plain 
encoded.
+val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
+data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
+val file = 
SpecificParquetRecordReaderBase.listDirectory(dir).asScala.head
+
+val reader = new VectorizedParquetRecordReader
+reader.initialize(file, null /* set columns to null to project all 
columns */)
+val column = reader.resultBatch().column(0)
+assert(reader.nextBatch())
+
+(0 until 512).foreach { i =>
+  assert(column.getUTF8String(3 * i).toString == i.toString)
+  assert(column.getUTF8String(3 * i + 1).toString == i.toString)
+  assert(column.getUTF8String(3 * i + 2).toString == i.toString)
+}
+  }
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat

2016-07-25 Thread lian
Repository: spark
Updated Branches:
  refs/heads/master d6a52176a -> 79826f3c7


[SPARK-16698][SQL] Field names having dots should be allowed for datasources 
based on FileFormat

## What changes were proposed in this pull request?

It seems this is a regression assuming from 
https://issues.apache.org/jira/browse/SPARK-16698.

Field name having dots throws an exception. For example the codes below:

```scala
val path = "/tmp/path"
val json =""" {"a.b":"data"}"""
spark.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path)
spark.read.json(path).collect()
```

throws an exception as below:

```
Unable to resolve a.b given [a.b];
org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
at scala.Option.getOrElse(Option.scala:121)
```

This problem was introduced in 
https://github.com/apache/spark/commit/17eec0a71ba8713c559d641e3f43a1be726b037c#diff-27c76f96a7b2733ecfd6f46a1716e153R121

When extracting the data columns, it does not count that it can contains dots 
in field names. Actually, it seems the fields name are not expected as quoted 
when defining schema. So, It not have to consider whether this is wrapped with 
quotes because the actual schema (inferred or user-given schema) would not have 
the quotes for fields.

For example, this throws an exception. (**Loading JSON from RDD is fine**)

```scala
val json =""" {"a.b":"data"}"""
val rdd = spark.sparkContext.parallelize(json :: Nil)
spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true
  .json(rdd).select("`a.b`").printSchema()
```

as below:

```
cannot resolve '```a.b```' given input columns: [`a.b`];
org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input 
columns: [`a.b`];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

## How was this patch tested?

Unit tests in `FileSourceStrategySuite`.

Author: hyukjinkwon 

Closes #14339 from HyukjinKwon/SPARK-16698-regression.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79826f3c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79826f3c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79826f3c

Branch: refs/heads/master
Commit: 79826f3c7936ee27457d030c7115d5cac69befd7
Parents: d6a5217
Author: hyukjinkwon 
Authored: Mon Jul 25 22:51:30 2016 +0800
Committer: Cheng Lian 
Committed: Mon Jul 25 22:51:30 2016 +0800

--
 .../sql/catalyst/plans/logical/LogicalPlan.scala |  2 +-
 .../scala/org/apache/spark/sql/SQLQuerySuite.scala   | 15 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/79826f3c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index d0b2b5d..6d77991 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -127,7 +127,7 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] 
with Logging {
*/
   def resolve(schema: StructType, resolver: Resolver): Seq[Attribute] = {
 schema.map { field =>
-  resolveQuoted(field.name, resolver).map {
+  resolve(field.name :: Nil, resolver).map {
 case a: AttributeReference => a
 case other => sys.error(s"can not handle nested schema yet...  plan 
$this")
   }.getOrElse {

http://git-wip-us.apache.org/repos/asf/spark/blob/79826f3c/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index aa80d61..06cc2a5 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -2982,4 +2982,19 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 """.stripMargin), Nil)
 }
   }
+
+  test("SPARK-16674: field names containing dots for both fields and 
partitioned fields") {
+withTempPath { path =>
+  val data = (1 to 10).map(i => (i, s"data-$i", i % 2, 

spark git commit: [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat

2016-07-25 Thread lian
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 fcbb7f653 -> b52e639a8


[SPARK-16698][SQL] Field names having dots should be allowed for datasources 
based on FileFormat

## What changes were proposed in this pull request?

It seems this is a regression assuming from 
https://issues.apache.org/jira/browse/SPARK-16698.

Field name having dots throws an exception. For example the codes below:

```scala
val path = "/tmp/path"
val json =""" {"a.b":"data"}"""
spark.sparkContext
  .parallelize(json :: Nil)
  .saveAsTextFile(path)
spark.read.json(path).collect()
```

throws an exception as below:

```
Unable to resolve a.b given [a.b];
org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b];
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
at scala.Option.getOrElse(Option.scala:121)
```

This problem was introduced in 
https://github.com/apache/spark/commit/17eec0a71ba8713c559d641e3f43a1be726b037c#diff-27c76f96a7b2733ecfd6f46a1716e153R121

When extracting the data columns, it does not count that it can contains dots 
in field names. Actually, it seems the fields name are not expected as quoted 
when defining schema. So, It not have to consider whether this is wrapped with 
quotes because the actual schema (inferred or user-given schema) would not have 
the quotes for fields.

For example, this throws an exception. (**Loading JSON from RDD is fine**)

```scala
val json =""" {"a.b":"data"}"""
val rdd = spark.sparkContext.parallelize(json :: Nil)
spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true
  .json(rdd).select("`a.b`").printSchema()
```

as below:

```
cannot resolve '```a.b```' given input columns: [`a.b`];
org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input 
columns: [`a.b`];
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

## How was this patch tested?

Unit tests in `FileSourceStrategySuite`.

Author: hyukjinkwon 

Closes #14339 from HyukjinKwon/SPARK-16698-regression.

(cherry picked from commit 79826f3c7936ee27457d030c7115d5cac69befd7)
Signed-off-by: Cheng Lian 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b52e639a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b52e639a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b52e639a

Branch: refs/heads/branch-2.0
Commit: b52e639a84a851e0b9159a0f6dae92664425042e
Parents: fcbb7f6
Author: hyukjinkwon 
Authored: Mon Jul 25 22:51:30 2016 +0800
Committer: Cheng Lian 
Committed: Mon Jul 25 22:51:56 2016 +0800

--
 .../sql/catalyst/plans/logical/LogicalPlan.scala |  2 +-
 .../scala/org/apache/spark/sql/SQLQuerySuite.scala   | 15 +++
 2 files changed, 16 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b52e639a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
index d0b2b5d..6d77991 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
@@ -127,7 +127,7 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] 
with Logging {
*/
   def resolve(schema: StructType, resolver: Resolver): Seq[Attribute] = {
 schema.map { field =>
-  resolveQuoted(field.name, resolver).map {
+  resolve(field.name :: Nil, resolver).map {
 case a: AttributeReference => a
 case other => sys.error(s"can not handle nested schema yet...  plan 
$this")
   }.getOrElse {

http://git-wip-us.apache.org/repos/asf/spark/blob/b52e639a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index f1a2410..be84dff 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -2946,4 +2946,19 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 """.stripMargin), Nil)
 }
   }
+
+  test("SPARK-16674: field names containing dots for both fields and 
partit

spark git commit: [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 79826f3c7 -> 7ea6d282b


[SPARK-16703][SQL] Remove extra whitespace in SQL generation for window 
functions

## What changes were proposed in this pull request?

This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no 
partitioning expressions are present.

Before:

```sql
( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

After:

```sql
(ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

## How was this patch tested?

New test case added in `ExpressionSQLBuilderSuite`.

Author: Cheng Lian 

Closes #14334 from liancheng/window-spec-sql-format.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ea6d282
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ea6d282
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ea6d282

Branch: refs/heads/master
Commit: 7ea6d282b925819ddb3874a67b3c9da8cc41f131
Parents: 79826f3
Author: Cheng Lian 
Authored: Mon Jul 25 09:42:39 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 09:42:39 2016 -0700

--
 .../expressions/windowExpressions.scala |  6 ++--
 .../sqlgen/aggregate_functions_and_window.sql   |  2 +-
 .../sqlgen/regular_expressions_and_window.sql   |  2 +-
 .../test/resources/sqlgen/window_basic_1.sql|  2 +-
 .../test/resources/sqlgen/window_basic_2.sql|  2 +-
 .../catalyst/ExpressionSQLBuilderSuite.scala| 35 ++--
 6 files changed, 40 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7ea6d282/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index c0b453d..e35192c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -82,16 +82,16 @@ case class WindowSpecDefinition(
 val partition = if (partitionSpec.isEmpty) {
   ""
 } else {
-  "PARTITION BY " + partitionSpec.map(_.sql).mkString(", ")
+  "PARTITION BY " + partitionSpec.map(_.sql).mkString(", ") + " "
 }
 
 val order = if (orderSpec.isEmpty) {
   ""
 } else {
-  "ORDER BY " + orderSpec.map(_.sql).mkString(", ")
+  "ORDER BY " + orderSpec.map(_.sql).mkString(", ") + " "
 }
 
-s"($partition $order ${frameSpecification.toString})"
+s"($partition$order${frameSpecification.toString})"
   }
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/7ea6d282/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql
--
diff --git 
a/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql 
b/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql
index 3a29bcf..c94f53b 100644
--- a/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql
+++ b/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql
@@ -1,4 +1,4 @@
 -- This file is automatically generated by LogicalPlanToSQLSuite.
 SELECT MAX(c) + COUNT(a) OVER () FROM parquet_t2 GROUP BY a, b
 

-SELECT `gen_attr` AS `(max(c) + count(a) OVER (  ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING))` FROM (SELECT (`gen_attr` + `gen_attr`) AS 
`gen_attr` FROM (SELECT gen_subquery_1.`gen_attr`, gen_subquery_1.`gen_attr`, 
count(`gen_attr`) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING) AS `gen_attr` FROM (SELECT max(`gen_attr`) AS `gen_attr`, `gen_attr` 
FROM (SELECT `a` AS `gen_attr`, `b` AS `gen_attr`, `c` AS `gen_attr`, `d` AS 
`gen_attr` FROM `default`.`parquet_t2`) AS gen_subquery_0 GROUP BY `gen_attr`, 
`gen_attr`) AS gen_subquery_1) AS gen_subquery_2) AS gen_subquery_3
+SELECT `gen_attr` AS `(max(c) + count(a) OVER (ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING))` FROM (SELECT (`gen_attr` + `gen_attr`) AS 
`gen_attr` FROM (SELECT gen_subquery_1.`gen_attr`, gen_subquery_1.`gen_attr`, 
count(`gen_attr`) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING) AS `gen_attr` FROM (SELECT max(`gen_attr`) AS `gen_attr`, `gen_attr` 
FROM (SELECT `a` AS `gen_attr`, `b` AS `gen_attr`, `c` AS `gen_attr`, `d` AS 
`gen_attr` FROM `default`.`parquet_t2`) AS gen_subquery_0 GROUP BY `gen_attr`, 
`gen_attr`) AS gen_subquery_1) AS gen_subquery_2) AS gen_subquery_3

http://git-wip-us.apache.org/repos/asf/s

spark git commit: [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b52e639a8 -> 57d65e511


[SPARK-16703][SQL] Remove extra whitespace in SQL generation for window 
functions

## What changes were proposed in this pull request?

This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no 
partitioning expressions are present.

Before:

```sql
( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

After:

```sql
(ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
```

## How was this patch tested?

New test case added in `ExpressionSQLBuilderSuite`.

Author: Cheng Lian 

Closes #14334 from liancheng/window-spec-sql-format.

(cherry picked from commit 7ea6d282b925819ddb3874a67b3c9da8cc41f131)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57d65e51
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57d65e51
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57d65e51

Branch: refs/heads/branch-2.0
Commit: 57d65e5111e281d3d5224c5ea11005c89718f791
Parents: b52e639
Author: Cheng Lian 
Authored: Mon Jul 25 09:42:39 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 09:42:47 2016 -0700

--
 .../expressions/windowExpressions.scala |  6 ++--
 .../sqlgen/aggregate_functions_and_window.sql   |  2 +-
 .../sqlgen/regular_expressions_and_window.sql   |  2 +-
 .../test/resources/sqlgen/window_basic_1.sql|  2 +-
 .../test/resources/sqlgen/window_basic_2.sql|  2 +-
 .../catalyst/ExpressionSQLBuilderSuite.scala| 35 ++--
 6 files changed, 40 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/57d65e51/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index c0b453d..e35192c 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -82,16 +82,16 @@ case class WindowSpecDefinition(
 val partition = if (partitionSpec.isEmpty) {
   ""
 } else {
-  "PARTITION BY " + partitionSpec.map(_.sql).mkString(", ")
+  "PARTITION BY " + partitionSpec.map(_.sql).mkString(", ") + " "
 }
 
 val order = if (orderSpec.isEmpty) {
   ""
 } else {
-  "ORDER BY " + orderSpec.map(_.sql).mkString(", ")
+  "ORDER BY " + orderSpec.map(_.sql).mkString(", ") + " "
 }
 
-s"($partition $order ${frameSpecification.toString})"
+s"($partition$order${frameSpecification.toString})"
   }
 }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/57d65e51/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql
--
diff --git 
a/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql 
b/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql
index 3a29bcf..c94f53b 100644
--- a/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql
+++ b/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql
@@ -1,4 +1,4 @@
 -- This file is automatically generated by LogicalPlanToSQLSuite.
 SELECT MAX(c) + COUNT(a) OVER () FROM parquet_t2 GROUP BY a, b
 

-SELECT `gen_attr` AS `(max(c) + count(a) OVER (  ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING))` FROM (SELECT (`gen_attr` + `gen_attr`) AS 
`gen_attr` FROM (SELECT gen_subquery_1.`gen_attr`, gen_subquery_1.`gen_attr`, 
count(`gen_attr`) OVER (  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING) AS `gen_attr` FROM (SELECT max(`gen_attr`) AS `gen_attr`, `gen_attr` 
FROM (SELECT `a` AS `gen_attr`, `b` AS `gen_attr`, `c` AS `gen_attr`, `d` AS 
`gen_attr` FROM `default`.`parquet_t2`) AS gen_subquery_0 GROUP BY `gen_attr`, 
`gen_attr`) AS gen_subquery_1) AS gen_subquery_2) AS gen_subquery_3
+SELECT `gen_attr` AS `(max(c) + count(a) OVER (ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING))` FROM (SELECT (`gen_attr` + `gen_attr`) AS 
`gen_attr` FROM (SELECT gen_subquery_1.`gen_attr`, gen_subquery_1.`gen_attr`, 
count(`gen_attr`) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING) AS `gen_attr` FROM (SELECT max(`gen_attr`) AS `gen_attr`, `gen_attr` 
FROM (SELECT `a` AS `gen_attr`, `b` AS `gen_attr`, `c` AS `gen_attr`, `d` AS 
`gen_attr` FROM `default`.`parquet_t2`) AS gen_subquery_0 GROUP BY `gen_attr`, 
`ge

spark git commit: [SPARKR][DOCS] fix broken url in doc

2016-07-25 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/master 7ea6d282b -> b73defdd7


[SPARKR][DOCS] fix broken url in doc

## What changes were proposed in this pull request?

Fix broken url, also,

sparkR.session.stop doc page should have it in the header, instead of saying 
"sparkR.stop"
![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png)

Data type section is in the middle of a list of gapply/gapplyCollect 
subsections:
![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png)

## How was this patch tested?

manual test

Author: Felix Cheung 

Closes #14329 from felixcheung/rdoclinkfix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b73defdd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b73defdd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b73defdd

Branch: refs/heads/master
Commit: b73defdd790cb823a4f9958ca89cec06fd198051
Parents: 7ea6d28
Author: Felix Cheung 
Authored: Mon Jul 25 11:25:41 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jul 25 11:25:41 2016 -0700

--
 R/pkg/R/DataFrame.R |   2 +-
 R/pkg/R/sparkR.R|  16 +++
 docs/sparkr.md  | 107 +++
 3 files changed, 62 insertions(+), 63 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b73defdd/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 2e99aa0..a473331 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -35,7 +35,7 @@ setOldClass("structType")
 #' @slot env An R environment that stores bookkeeping states of the 
SparkDataFrame
 #' @slot sdf A Java object reference to the backing Scala DataFrame
 #' @seealso \link{createDataFrame}, \link{read.json}, \link{table}
-#' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe}
+#' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes}
 #' @export
 #' @examples
 #'\dontrun{

http://git-wip-us.apache.org/repos/asf/spark/blob/b73defdd/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index ff5297f..524f7c4 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -28,14 +28,6 @@ connExists <- function(env) {
   })
 }
 
-#' @rdname sparkR.session.stop
-#' @name sparkR.stop
-#' @export
-#' @note sparkR.stop since 1.4.0
-sparkR.stop <- function() {
-  sparkR.session.stop()
-}
-
 #' Stop the Spark Session and Spark Context
 #'
 #' Stop the Spark Session and Spark Context.
@@ -90,6 +82,14 @@ sparkR.session.stop <- function() {
   clearJobjs()
 }
 
+#' @rdname sparkR.session.stop
+#' @name sparkR.stop
+#' @export
+#' @note sparkR.stop since 1.4.0
+sparkR.stop <- function() {
+  sparkR.session.stop()
+}
+
 #' (Deprecated) Initialize a new Spark Context
 #'
 #' This function initializes a new SparkContext.

http://git-wip-us.apache.org/repos/asf/spark/blob/b73defdd/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index dfa5278..4bbc362 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -322,8 +322,59 @@ head(ldf, 3)
 Apply a function to each group of a `SparkDataFrame`. The function is to be 
applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
 that key. The groups are chosen from `SparkDataFrame`s column(s).
 The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
-`SparkDataFrame`. It must represent R function's output schema on the basis of 
Spark data types. The column names of the returned `data.frame` are set by 
user. Below is the data type mapping between R
-and Spark.
+`SparkDataFrame`. It must represent R function's output schema on the basis of 
Spark [data types](#data-type-mapping-between-r-and-spark). The column names of 
the returned `data.frame` are set by user.
+
+
+{% highlight r %}
+
+# Determine six waiting times with the largest eruption time in minutes.
+schema <- structType(structField("waiting", "double"), 
structField("max_eruption", "double"))
+result <- gapply(
+df,
+"waiting",
+function(key, x) {
+y <- data.frame(key, max(x$eruptions))
+},
+schema)
+head(collect(arrange(result, "max_eruption", decreasing = TRUE)))
+
+##waiting   max_eruption
+##1  64   5.100
+##2  69   5.067
+##3  71   5.033
+##4  87   5.000
+##5  63   4.933
+##6  89   4.900
+{% endhighlight %}
+
+
+# gapplyCollect
+Like `gapply`, applies a function to each partition of 

spark git commit: [SPARKR][DOCS] fix broken url in doc

2016-07-25 Thread shivaram
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 57d65e511 -> d9bd066b9


[SPARKR][DOCS] fix broken url in doc

## What changes were proposed in this pull request?

Fix broken url, also,

sparkR.session.stop doc page should have it in the header, instead of saying 
"sparkR.stop"
![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png)

Data type section is in the middle of a list of gapply/gapplyCollect 
subsections:
![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png)

## How was this patch tested?

manual test

Author: Felix Cheung 

Closes #14329 from felixcheung/rdoclinkfix.

(cherry picked from commit b73defdd790cb823a4f9958ca89cec06fd198051)
Signed-off-by: Shivaram Venkataraman 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d9bd066b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d9bd066b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d9bd066b

Branch: refs/heads/branch-2.0
Commit: d9bd066b9f37cfd18037b9a600371d0342703c0f
Parents: 57d65e5
Author: Felix Cheung 
Authored: Mon Jul 25 11:25:41 2016 -0700
Committer: Shivaram Venkataraman 
Committed: Mon Jul 25 11:25:51 2016 -0700

--
 R/pkg/R/DataFrame.R |   2 +-
 R/pkg/R/sparkR.R|  16 +++
 docs/sparkr.md  | 107 +++
 3 files changed, 62 insertions(+), 63 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d9bd066b/R/pkg/R/DataFrame.R
--
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 92c10f1..aa211b3 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -35,7 +35,7 @@ setOldClass("structType")
 #' @slot env An R environment that stores bookkeeping states of the 
SparkDataFrame
 #' @slot sdf A Java object reference to the backing Scala DataFrame
 #' @seealso \link{createDataFrame}, \link{read.json}, \link{table}
-#' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe}
+#' @seealso 
\url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes}
 #' @export
 #' @examples
 #'\dontrun{

http://git-wip-us.apache.org/repos/asf/spark/blob/d9bd066b/R/pkg/R/sparkR.R
--
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index ff5297f..524f7c4 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -28,14 +28,6 @@ connExists <- function(env) {
   })
 }
 
-#' @rdname sparkR.session.stop
-#' @name sparkR.stop
-#' @export
-#' @note sparkR.stop since 1.4.0
-sparkR.stop <- function() {
-  sparkR.session.stop()
-}
-
 #' Stop the Spark Session and Spark Context
 #'
 #' Stop the Spark Session and Spark Context.
@@ -90,6 +82,14 @@ sparkR.session.stop <- function() {
   clearJobjs()
 }
 
+#' @rdname sparkR.session.stop
+#' @name sparkR.stop
+#' @export
+#' @note sparkR.stop since 1.4.0
+sparkR.stop <- function() {
+  sparkR.session.stop()
+}
+
 #' (Deprecated) Initialize a new Spark Context
 #'
 #' This function initializes a new SparkContext.

http://git-wip-us.apache.org/repos/asf/spark/blob/d9bd066b/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index dfa5278..4bbc362 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -322,8 +322,59 @@ head(ldf, 3)
 Apply a function to each group of a `SparkDataFrame`. The function is to be 
applied to each group of the `SparkDataFrame` and should have only two 
parameters: grouping key and R `data.frame` corresponding to
 that key. The groups are chosen from `SparkDataFrame`s column(s).
 The output of function should be a `data.frame`. Schema specifies the row 
format of the resulting
-`SparkDataFrame`. It must represent R function's output schema on the basis of 
Spark data types. The column names of the returned `data.frame` are set by 
user. Below is the data type mapping between R
-and Spark.
+`SparkDataFrame`. It must represent R function's output schema on the basis of 
Spark [data types](#data-type-mapping-between-r-and-spark). The column names of 
the returned `data.frame` are set by user.
+
+
+{% highlight r %}
+
+# Determine six waiting times with the largest eruption time in minutes.
+schema <- structType(structField("waiting", "double"), 
structField("max_eruption", "double"))
+result <- gapply(
+df,
+"waiting",
+function(key, x) {
+y <- data.frame(key, max(x$eruptions))
+},
+schema)
+head(collect(arrange(result, "max_eruption", decreasing = TRUE)))
+
+##waiting   max_eruption
+##1  64   5.100
+##2  69   5.067
+##3  71   5.033
+##4  87   5.000
+##5  63   4.933
+##6  

spark git commit: [SPARK-16685] Remove audit-release scripts.

2016-07-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master ad3708e78 -> dd784a882


[SPARK-16685] Remove audit-release scripts.

## What changes were proposed in this pull request?
This patch removes dev/audit-release. It was initially created to do basic 
release auditing. They have been unused by for the last one year+.

## How was this patch tested?
N/A

Author: Reynold Xin 

Closes #14342 from rxin/SPARK-16685.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd784a88
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd784a88
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd784a88

Branch: refs/heads/master
Commit: dd784a8822497ad0631208d56325c4d74ab9e036
Parents: ad3708e
Author: Reynold Xin 
Authored: Mon Jul 25 20:03:54 2016 +0100
Committer: Sean Owen 
Committed: Mon Jul 25 20:03:54 2016 +0100

--
 dev/audit-release/.gitignore|   2 -
 dev/audit-release/README.md |  12 -
 dev/audit-release/audit_release.py  | 236 ---
 dev/audit-release/blank_maven_build/pom.xml |  43 
 dev/audit-release/blank_sbt_build/build.sbt |  30 ---
 dev/audit-release/maven_app_core/input.txt  |   8 -
 dev/audit-release/maven_app_core/pom.xml|  52 
 .../maven_app_core/src/main/java/SimpleApp.java |  42 
 dev/audit-release/sbt_app_core/build.sbt|  28 ---
 dev/audit-release/sbt_app_core/input.txt|   8 -
 .../sbt_app_core/src/main/scala/SparkApp.scala  |  63 -
 dev/audit-release/sbt_app_ganglia/build.sbt |  30 ---
 .../src/main/scala/SparkApp.scala   |  41 
 dev/audit-release/sbt_app_graphx/build.sbt  |  28 ---
 .../src/main/scala/GraphxApp.scala  |  55 -
 dev/audit-release/sbt_app_hive/build.sbt|  29 ---
 dev/audit-release/sbt_app_hive/data.txt |   9 -
 .../sbt_app_hive/src/main/scala/HiveApp.scala   |  59 -
 dev/audit-release/sbt_app_kinesis/build.sbt |  28 ---
 .../src/main/scala/SparkApp.scala   |  35 ---
 dev/audit-release/sbt_app_sql/build.sbt |  28 ---
 .../sbt_app_sql/src/main/scala/SqlApp.scala |  61 -
 dev/audit-release/sbt_app_streaming/build.sbt   |  28 ---
 .../src/main/scala/StreamingApp.scala   |  65 -
 24 files changed, 1020 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dd784a88/dev/audit-release/.gitignore
--
diff --git a/dev/audit-release/.gitignore b/dev/audit-release/.gitignore
deleted file mode 100644
index 7e057a9..000
--- a/dev/audit-release/.gitignore
+++ /dev/null
@@ -1,2 +0,0 @@
-project/
-spark_audit*

http://git-wip-us.apache.org/repos/asf/spark/blob/dd784a88/dev/audit-release/README.md
--
diff --git a/dev/audit-release/README.md b/dev/audit-release/README.md
deleted file mode 100644
index 37b2a0a..000
--- a/dev/audit-release/README.md
+++ /dev/null
@@ -1,12 +0,0 @@
-Test Application Builds
-===
-
-This directory includes test applications which are built when auditing 
releases. You can run them locally by setting appropriate environment variables.
-
-```
-$ cd sbt_app_core
-$ SCALA_VERSION=2.11.7 \
-  SPARK_VERSION=1.0.0-SNAPSHOT \
-  SPARK_RELEASE_REPOSITORY=file:///home/patrick/.ivy2/local \
-  sbt run
-```

http://git-wip-us.apache.org/repos/asf/spark/blob/dd784a88/dev/audit-release/audit_release.py
--
diff --git a/dev/audit-release/audit_release.py 
b/dev/audit-release/audit_release.py
deleted file mode 100755
index b28e7a4..000
--- a/dev/audit-release/audit_release.py
+++ /dev/null
@@ -1,236 +0,0 @@
-#!/usr/bin/python
-
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-# Audits binary and maven artifacts for a Spark release.
-# Requires GPG and Maven.
-# usage:
-#   python audit_release.py
-
-import os
-import re
-import shutil
-import subprocess
-import sys
-import time

spark git commit: [SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6

2016-07-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master b73defdd7 -> ad3708e78


[SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 
1e-6

## What changes were proposed in this pull request?

replace ANN convergence tolerance param default
from 1e-4 to 1e-6

so that it will be the same with other algorithms in MLLib which use LBFGS as 
optimizer.

## How was this patch tested?

Existing Test.

Author: WeichenXu 

Closes #14286 from WeichenXu123/update_ann_tol.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad3708e7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad3708e7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad3708e7

Branch: refs/heads/master
Commit: ad3708e78377d631e3d586548c961f4748322bf0
Parents: b73defd
Author: WeichenXu 
Authored: Mon Jul 25 20:00:37 2016 +0100
Committer: Sean Owen 
Committed: Mon Jul 25 20:00:37 2016 +0100

--
 .../ml/classification/MultilayerPerceptronClassifier.scala   | 4 ++--
 python/pyspark/ml/classification.py  | 8 
 2 files changed, 6 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ad3708e7/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index 76ef32a..7264a99 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -100,7 +100,7 @@ private[classification] trait MultilayerPerceptronParams 
extends PredictorParams
   @Since("2.0.0")
   final def getInitialWeights: Vector = $(initialWeights)
 
-  setDefault(maxIter -> 100, tol -> 1e-4, blockSize -> 128,
+  setDefault(maxIter -> 100, tol -> 1e-6, blockSize -> 128,
 solver -> MultilayerPerceptronClassifier.LBFGS, stepSize -> 0.03)
 }
 
@@ -190,7 +190,7 @@ class MultilayerPerceptronClassifier @Since("1.5.0") (
   /**
* Set the convergence tolerance of iterations.
* Smaller value will lead to higher accuracy with the cost of more 
iterations.
-   * Default is 1E-4.
+   * Default is 1E-6.
*
* @group setParam
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/ad3708e7/python/pyspark/ml/classification.py
--
diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 613bc8c..9a3c7b1 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -1124,11 +1124,11 @@ class MultilayerPerceptronClassifier(JavaEstimator, 
HasFeaturesCol, HasLabelCol,
 
 @keyword_only
 def __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
- maxIter=100, tol=1e-4, seed=None, layers=None, blockSize=128, 
stepSize=0.03,
+ maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, 
stepSize=0.03,
  solver="l-bfgs", initialWeights=None):
 """
 __init__(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
- maxIter=100, tol=1e-4, seed=None, layers=None, blockSize=128, 
stepSize=0.03, \
+ maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, 
stepSize=0.03, \
  solver="l-bfgs", initialWeights=None)
 """
 super(MultilayerPerceptronClassifier, self).__init__()
@@ -1141,11 +1141,11 @@ class MultilayerPerceptronClassifier(JavaEstimator, 
HasFeaturesCol, HasLabelCol,
 @keyword_only
 @since("1.6.0")
 def setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
-  maxIter=100, tol=1e-4, seed=None, layers=None, 
blockSize=128, stepSize=0.03,
+  maxIter=100, tol=1e-6, seed=None, layers=None, 
blockSize=128, stepSize=0.03,
   solver="l-bfgs", initialWeights=None):
 """
 setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
-  maxIter=100, tol=1e-4, seed=None, layers=None, 
blockSize=128, stepSize=0.03, \
+  maxIter=100, tol=1e-6, seed=None, layers=None, 
blockSize=128, stepSize=0.03, \
   solver="l-bfgs", initialWeights=None)
 Sets params for MultilayerPerceptronClassifier.
 """


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additio

spark git commit: [SPARK-15271][MESOS] Allow force pulling executor docker images

2016-07-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master dd784a882 -> 978cd5f12


[SPARK-15271][MESOS] Allow force pulling executor docker images

## What changes were proposed in this pull request?

Mesos agents by default will not pull docker images which are cached
locally already. In order to run Spark executors from mutable tags like
`:latest` this commit introduces a Spark setting
`spark.mesos.executor.docker.forcePullImage`. Setting this flag to
true will tell the Mesos agent to force pull the docker image (default is 
`false` which is consistent with the previous
implementation and Mesos' default
behaviour).

## How was this patch tested?

I ran a sample application including this change on a Mesos cluster and 
verified the correct behaviour for both, with and without, force pulling the 
executor image. As expected the image is being force pulled if the flag is set.

Author: Philipp Hoffmann 

Closes #13051 from philipphoffmann/force-pull-image.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/978cd5f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/978cd5f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/978cd5f1

Branch: refs/heads/master
Commit: 978cd5f125eb5a410bad2e60bf8385b11cf1b978
Parents: dd784a8
Author: Philipp Hoffmann 
Authored: Mon Jul 25 20:14:47 2016 +0100
Committer: Sean Owen 
Committed: Mon Jul 25 20:14:47 2016 +0100

--
 .../cluster/mesos/MesosClusterScheduler.scala   | 14 ++---
 .../MesosCoarseGrainedSchedulerBackend.scala|  7 ++-
 .../MesosFineGrainedSchedulerBackend.scala  |  7 ++-
 .../mesos/MesosSchedulerBackendUtil.scala   | 20 ---
 ...esosCoarseGrainedSchedulerBackendSuite.scala | 63 
 .../MesosFineGrainedSchedulerBackendSuite.scala |  2 +
 dev/deps/spark-deps-hadoop-2.2  |  2 +-
 dev/deps/spark-deps-hadoop-2.3  |  2 +-
 dev/deps/spark-deps-hadoop-2.4  |  2 +-
 dev/deps/spark-deps-hadoop-2.6  |  2 +-
 dev/deps/spark-deps-hadoop-2.7  |  2 +-
 docs/_config.yml|  2 +-
 docs/running-on-mesos.md| 12 
 pom.xml |  2 +-
 14 files changed, 110 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/978cd5f1/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
index 39b0f4d..1e9644d 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
@@ -537,16 +537,10 @@ private[spark] class MesosClusterScheduler(
   .addAllResources(memResourcesToUse.asJava)
 offer.resources = finalResources.asJava
 
submission.schedulerProperties.get("spark.mesos.executor.docker.image").foreach 
{ image =>
-  val container = taskInfo.getContainerBuilder()
-  val volumes = submission.schedulerProperties
-.get("spark.mesos.executor.docker.volumes")
-.map(MesosSchedulerBackendUtil.parseVolumesSpec)
-  val portmaps = submission.schedulerProperties
-.get("spark.mesos.executor.docker.portmaps")
-.map(MesosSchedulerBackendUtil.parsePortMappingsSpec)
-  MesosSchedulerBackendUtil.addDockerInfo(
-container, image, volumes = volumes, portmaps = portmaps)
-  taskInfo.setContainer(container.build())
+  MesosSchedulerBackendUtil.setupContainerBuilderDockerInfo(
+image,
+submission.schedulerProperties.get,
+taskInfo.getContainerBuilder())
 }
 val queuedTasks = tasks.getOrElseUpdate(offer.offerId, new 
ArrayBuffer[TaskInfo])
 queuedTasks += taskInfo.build()

http://git-wip-us.apache.org/repos/asf/spark/blob/978cd5f1/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
index 99e6d39..52993ca 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend

spark git commit: [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc

2016-07-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 978cd5f12 -> 3b6e1d094


[SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc

## What changes were proposed in this pull request?

Fixed several inline formatting in ml features doc.

Before:

https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png";>

After:

https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png";>

## How was this patch tested?

Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the 
browser.

Author: Shuai Lin 

Closes #14194 from lins05/fix-docs-formatting.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3b6e1d09
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3b6e1d09
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3b6e1d09

Branch: refs/heads/master
Commit: 3b6e1d094e153599e158331b10d33d74a667be5a
Parents: 978cd5f
Author: Shuai Lin 
Authored: Mon Jul 25 20:26:55 2016 +0100
Committer: Sean Owen 
Committed: Mon Jul 25 20:26:55 2016 +0100

--
 docs/ml-features.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3b6e1d09/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index e7d7ddf..6020114 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -216,7 +216,7 @@ for more details on the API.
 
 
[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer)
 allows more
  advanced tokenization based on regular expression (regex) matching.
- By default, the parameter "pattern" (regex, default: \\s+) is used as 
delimiters to split the input text.
+ By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as 
delimiters to split the input text.
  Alternatively, users can set parameter "gaps" to false indicating the regex 
"pattern" denotes
  "tokens" rather than splitting gaps, and find all matching occurrences as the 
tokenization result.
 
@@ -815,7 +815,7 @@ The rescaled value for a feature E is calculated as,
 `\begin{equation}
   Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
 \end{equation}`
-For the case `E_{max} == E_{min}`, `Rescaled(e_i) = 0.5 * (max + min)`
+For the case `$E_{max} == E_{min}$`, `$Rescaled(e_i) = 0.5 * (max + min)$`
 
 Note that since zero values will probably be transformed to non-zero values, 
output of the transformer will be `DenseVector` even for sparse input.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc

2016-07-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 d9bd066b9 -> f0d05f669


[SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc

## What changes were proposed in this pull request?

Fixed several inline formatting in ml features doc.

Before:

https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png";>

After:

https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png";>

## How was this patch tested?

Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the 
browser.

Author: Shuai Lin 

Closes #14194 from lins05/fix-docs-formatting.

(cherry picked from commit 3b6e1d094e153599e158331b10d33d74a667be5a)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f0d05f66
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f0d05f66
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f0d05f66

Branch: refs/heads/branch-2.0
Commit: f0d05f669b4e7be017d8d0cfba33c3a61a1eef8f
Parents: d9bd066
Author: Shuai Lin 
Authored: Mon Jul 25 20:26:55 2016 +0100
Committer: Sean Owen 
Committed: Mon Jul 25 20:27:04 2016 +0100

--
 docs/ml-features.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f0d05f66/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index e7d7ddf..6020114 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -216,7 +216,7 @@ for more details on the API.
 
 
[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer)
 allows more
  advanced tokenization based on regular expression (regex) matching.
- By default, the parameter "pattern" (regex, default: \\s+) is used as 
delimiters to split the input text.
+ By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as 
delimiters to split the input text.
  Alternatively, users can set parameter "gaps" to false indicating the regex 
"pattern" denotes
  "tokens" rather than splitting gaps, and find all matching occurrences as the 
tokenization result.
 
@@ -815,7 +815,7 @@ The rescaled value for a feature E is calculated as,
 `\begin{equation}
   Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
 \end{equation}`
-For the case `E_{max} == E_{min}`, `Rescaled(e_i) = 0.5 * (max + min)`
+For the case `$E_{max} == E_{min}$`, `$Rescaled(e_i) = 0.5 * (max + min)$`
 
 Note that since zero values will probably be transformed to non-zero values, 
output of the transformer will be `DenseVector` even for sparse input.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"

2016-07-25 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/master 3b6e1d094 -> fc17121d5


Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"

This reverts commit 978cd5f125eb5a410bad2e60bf8385b11cf1b978.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc17121d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc17121d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc17121d

Branch: refs/heads/master
Commit: fc17121d592acbd7405135cd576bafc5c574650e
Parents: 3b6e1d0
Author: Josh Rosen 
Authored: Mon Jul 25 12:43:44 2016 -0700
Committer: Josh Rosen 
Committed: Mon Jul 25 12:43:44 2016 -0700

--
 .../cluster/mesos/MesosClusterScheduler.scala   | 14 +++--
 .../MesosCoarseGrainedSchedulerBackend.scala|  7 +--
 .../MesosFineGrainedSchedulerBackend.scala  |  7 +--
 .../mesos/MesosSchedulerBackendUtil.scala   | 20 +++
 ...esosCoarseGrainedSchedulerBackendSuite.scala | 63 
 .../MesosFineGrainedSchedulerBackendSuite.scala |  2 -
 dev/deps/spark-deps-hadoop-2.2  |  2 +-
 dev/deps/spark-deps-hadoop-2.3  |  2 +-
 dev/deps/spark-deps-hadoop-2.4  |  2 +-
 dev/deps/spark-deps-hadoop-2.6  |  2 +-
 dev/deps/spark-deps-hadoop-2.7  |  2 +-
 docs/_config.yml|  2 +-
 docs/running-on-mesos.md| 12 
 pom.xml |  2 +-
 14 files changed, 29 insertions(+), 110 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fc17121d/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
index 1e9644d..39b0f4d 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
@@ -537,10 +537,16 @@ private[spark] class MesosClusterScheduler(
   .addAllResources(memResourcesToUse.asJava)
 offer.resources = finalResources.asJava
 
submission.schedulerProperties.get("spark.mesos.executor.docker.image").foreach 
{ image =>
-  MesosSchedulerBackendUtil.setupContainerBuilderDockerInfo(
-image,
-submission.schedulerProperties.get,
-taskInfo.getContainerBuilder())
+  val container = taskInfo.getContainerBuilder()
+  val volumes = submission.schedulerProperties
+.get("spark.mesos.executor.docker.volumes")
+.map(MesosSchedulerBackendUtil.parseVolumesSpec)
+  val portmaps = submission.schedulerProperties
+.get("spark.mesos.executor.docker.portmaps")
+.map(MesosSchedulerBackendUtil.parsePortMappingsSpec)
+  MesosSchedulerBackendUtil.addDockerInfo(
+container, image, volumes = volumes, portmaps = portmaps)
+  taskInfo.setContainer(container.build())
 }
 val queuedTasks = tasks.getOrElseUpdate(offer.offerId, new 
ArrayBuffer[TaskInfo])
 queuedTasks += taskInfo.build()

http://git-wip-us.apache.org/repos/asf/spark/blob/fc17121d/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
index 52993ca..99e6d39 100644
--- 
a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
+++ 
b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
@@ -408,11 +408,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
 .addAllResources(memResourcesToUse.asJava)
 
   sc.conf.getOption("spark.mesos.executor.docker.image").foreach { 
image =>
-MesosSchedulerBackendUtil.setupContainerBuilderDockerInfo(
-  image,
-  sc.conf.getOption,
-  taskBuilder.getContainerBuilder
-)
+MesosSchedulerBackendUtil
+  .setupContainerBuilderDockerInfo(image, sc.conf, 
taskBuilder.getContainerBuilder)
   }
 
   tasks(offer.getId) ::= taskBuilder.build()

http://git-wip-us.apache.org/repos/asf/spark/blob/fc17121d/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFine

spark git commit: [SQL][DOC] Fix a default name for parquet compression

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master fc17121d5 -> cda4603de


[SQL][DOC] Fix a default name for parquet compression

## What changes were proposed in this pull request?
This pr is to fix a wrong description for parquet default compression.

Author: Takeshi YAMAMURO 

Closes #14351 from maropu/FixParquetDoc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cda4603d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cda4603d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cda4603d

Branch: refs/heads/master
Commit: cda4603de340d533c49feac1b244ddfd291f9bcf
Parents: fc17121
Author: Takeshi YAMAMURO 
Authored: Mon Jul 25 15:08:58 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 15:08:58 2016 -0700

--
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cda4603d/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index ad123d7..d8c8698 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -749,7 +749,7 @@ Configuration of Parquet can be done using the `setConf` 
method on `SparkSession
 
 
   spark.sql.parquet.compression.codec
-  gzip
+  snappy
   
 Sets the compression codec use when writing Parquet files. Acceptable 
values include:
 uncompressed, snappy, gzip, lzo.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SQL][DOC] Fix a default name for parquet compression

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f0d05f669 -> 1b4f7cf13


[SQL][DOC] Fix a default name for parquet compression

## What changes were proposed in this pull request?
This pr is to fix a wrong description for parquet default compression.

Author: Takeshi YAMAMURO 

Closes #14351 from maropu/FixParquetDoc.

(cherry picked from commit cda4603de340d533c49feac1b244ddfd291f9bcf)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b4f7cf1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b4f7cf1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b4f7cf1

Branch: refs/heads/branch-2.0
Commit: 1b4f7cf135eebc46f07649509a027b6d422dcfdf
Parents: f0d05f6
Author: Takeshi YAMAMURO 
Authored: Mon Jul 25 15:08:58 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 15:09:04 2016 -0700

--
 docs/sql-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1b4f7cf1/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index e92596b..33b170e 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -749,7 +749,7 @@ Configuration of Parquet can be done using the `setConf` 
method on `SparkSession
 
 
   spark.sql.parquet.compression.codec
-  gzip
+  snappy
   
 Sets the compression codec use when writing Parquet files. Acceptable 
values include:
 uncompressed, snappy, gzip, lzo.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16166][CORE] Also take off-heap memory usage into consideration in log and webui display

2016-07-25 Thread joshrosen
Repository: spark
Updated Branches:
  refs/heads/master cda4603de -> f5ea7fe53


[SPARK-16166][CORE] Also take off-heap memory usage into consideration in log 
and webui display

## What changes were proposed in this pull request?

Currently in the log and UI display, only on-heap storage memory is calculated 
and displayed,

```
16/06/27 13:41:52 INFO MemoryStore: Block rdd_5_0 stored as values in memory 
(estimated size 17.8 KB, free 665.9 MB)
```
https://cloud.githubusercontent.com/assets/850797/16369960/53fb614e-3c6e-11e6-8fa3-7ffe65abcb49.png";>

With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992) off-heap 
memory is supported for data persistence, so here change to also take off-heap 
storage memory into consideration.

## How was this patch tested?

Unit test and local verification.

Author: jerryshao 

Closes #13920 from jerryshao/SPARK-16166.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5ea7fe5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5ea7fe5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5ea7fe5

Branch: refs/heads/master
Commit: f5ea7fe53974a7e8cbfc222b9a6f47669b53ccfd
Parents: cda4603
Author: jerryshao 
Authored: Mon Jul 25 15:17:06 2016 -0700
Committer: Josh Rosen 
Committed: Mon Jul 25 15:17:06 2016 -0700

--
 .../scala/org/apache/spark/memory/MemoryManager.scala | 10 --
 .../org/apache/spark/memory/StaticMemoryManager.scala |  2 ++
 .../org/apache/spark/memory/UnifiedMemoryManager.scala|  4 
 .../scala/org/apache/spark/storage/BlockManager.scala |  5 +++--
 .../org/apache/spark/storage/memory/MemoryStore.scala |  4 +++-
 .../scala/org/apache/spark/memory/TestMemoryManager.scala |  2 ++
 .../org/apache/spark/storage/BlockManagerSuite.scala  |  8 
 7 files changed, 26 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea7fe5/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala
--
diff --git a/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala 
b/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala
index 0210217..82442cf 100644
--- a/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala
+++ b/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala
@@ -62,13 +62,19 @@ private[spark] abstract class MemoryManager(
   offHeapStorageMemoryPool.incrementPoolSize(offHeapStorageMemory)
 
   /**
-   * Total available memory for storage, in bytes. This amount can vary over 
time, depending on
-   * the MemoryManager implementation.
+   * Total available on heap memory for storage, in bytes. This amount can 
vary over time,
+   * depending on the MemoryManager implementation.
* In this model, this is equivalent to the amount of memory not occupied by 
execution.
*/
   def maxOnHeapStorageMemory: Long
 
   /**
+   * Total available off heap memory for storage, in bytes. This amount can 
vary over time,
+   * depending on the MemoryManager implementation.
+   */
+  def maxOffHeapStorageMemory: Long
+
+  /**
* Set the [[MemoryStore]] used by this manager to evict cached blocks.
* This must be set after construction due to initialization ordering 
constraints.
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea7fe5/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala 
b/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala
index 08155aa..a6f7db0 100644
--- a/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala
+++ b/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala
@@ -55,6 +55,8 @@ private[spark] class StaticMemoryManager(
 (maxOnHeapStorageMemory * conf.getDouble("spark.storage.unrollFraction", 
0.2)).toLong
   }
 
+  override def maxOffHeapStorageMemory: Long = 0L
+
   override def acquireStorageMemory(
   blockId: BlockId,
   numBytes: Long,

http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea7fe5/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala 
b/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala
index c7b36be..fea2808 100644
--- a/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala
+++ b/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala
@@ -67,6 +67,10 @@ private[spark] class UnifiedMemoryManager private[memory] (
 maxHeapMemory - onHeapExecutionMemoryPo

spark git commit: [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"

2016-07-25 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/master f5ea7fe53 -> 12f490b5c


[SPARK-16715][TESTS] Fix a potential ExprId conflict for 
SubexpressionEliminationSuite."Semantic equals and hash"

## What changes were proposed in this pull request?

SubexpressionEliminationSuite."Semantic equals and hash" assumes the default 
AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when 
this test runs. It may happen to use "ExprId(1)".

This PR detects the conflict and makes sure we create a different ExprId when 
the conflict happens.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu 

Closes #14350 from zsxwing/SPARK-16715.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/12f490b5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/12f490b5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/12f490b5

Branch: refs/heads/master
Commit: 12f490b5c85cdee26d47eb70ad1a1edd00504f21
Parents: f5ea7fe
Author: Shixiong Zhu 
Authored: Mon Jul 25 16:08:29 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Jul 25 16:08:29 2016 -0700

--
 .../catalyst/expressions/SubexpressionEliminationSuite.scala   | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/12f490b5/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
index 90e97d7..1e39b24 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
@@ -21,8 +21,12 @@ import org.apache.spark.sql.types.IntegerType
 
 class SubexpressionEliminationSuite extends SparkFunSuite {
   test("Semantic equals and hash") {
-val id = ExprId(1)
 val a: AttributeReference = AttributeReference("name", IntegerType)()
+val id = {
+  // Make sure we use a "ExprId" different from "a.exprId"
+  val _id = ExprId(1)
+  if (a.exprId == _id) ExprId(2) else _id
+}
 val b1 = a.withName("name2").withExprId(id)
 val b2 = a.withExprId(id)
 val b3 = a.withQualifier(Some("qualifierName"))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"

2016-07-25 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 1b4f7cf13 -> 41e72f659


[SPARK-16715][TESTS] Fix a potential ExprId conflict for 
SubexpressionEliminationSuite."Semantic equals and hash"

## What changes were proposed in this pull request?

SubexpressionEliminationSuite."Semantic equals and hash" assumes the default 
AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when 
this test runs. It may happen to use "ExprId(1)".

This PR detects the conflict and makes sure we create a different ExprId when 
the conflict happens.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu 

Closes #14350 from zsxwing/SPARK-16715.

(cherry picked from commit 12f490b5c85cdee26d47eb70ad1a1edd00504f21)
Signed-off-by: Shixiong Zhu 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41e72f65
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41e72f65
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41e72f65

Branch: refs/heads/branch-2.0
Commit: 41e72f65929c345aa21ebd4e00dadfbfb5acfdf3
Parents: 1b4f7cf
Author: Shixiong Zhu 
Authored: Mon Jul 25 16:08:29 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Jul 25 16:08:36 2016 -0700

--
 .../catalyst/expressions/SubexpressionEliminationSuite.scala   | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/41e72f65/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
index 90e97d7..1e39b24 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala
@@ -21,8 +21,12 @@ import org.apache.spark.sql.types.IntegerType
 
 class SubexpressionEliminationSuite extends SparkFunSuite {
   test("Semantic equals and hash") {
-val id = ExprId(1)
 val a: AttributeReference = AttributeReference("name", IntegerType)()
+val id = {
+  // Make sure we use a "ExprId" different from "a.exprId"
+  val _id = ExprId(1)
+  if (a.exprId == _id) ExprId(2) else _id
+}
 val b1 = a.withName("name2").withExprId(id)
 val b2 = a.withExprId(id)
 val b3 = a.withQualifier(Some("qualifierName"))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog

2016-07-25 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/master 12f490b5c -> c979c8bba


[SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in 
HDFSMetadataLog

## What changes were proposed in this pull request?
Current fix for deadlock disables interrupts in the StreamExecution which 
getting offsets for all sources, and when writing to any metadata log, to avoid 
potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, 
disabling interrupts can have unintended consequences in other sources. So I am 
making the fix more narrow, by disabling interrupt it only in the 
HDFSMetadataLog. This is a narrower fix for something risky like disabling 
interrupt.

## How was this patch tested?
Existing tests.

Author: Tathagata Das 

Closes #14292 from tdas/SPARK-14131.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c979c8bb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c979c8bb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c979c8bb

Branch: refs/heads/master
Commit: c979c8bba02bc89cb9ad81b212f085a8a5490a07
Parents: 12f490b
Author: Tathagata Das 
Authored: Mon Jul 25 16:09:22 2016 -0700
Committer: Tathagata Das 
Committed: Mon Jul 25 16:09:22 2016 -0700

--
 .../execution/streaming/HDFSMetadataLog.scala   | 31 ++
 .../execution/streaming/StreamExecution.scala   | 28 -
 .../streaming/FileStreamSinkLogSuite.scala  |  4 +-
 .../streaming/HDFSMetadataLogSuite.scala| 10 +++--
 .../apache/spark/sql/test/SQLTestUtils.scala| 43 +++-
 5 files changed, 80 insertions(+), 36 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c979c8bb/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
index 069e41b..698f07b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
@@ -32,6 +32,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.network.util.JavaUtils
 import org.apache.spark.serializer.JavaSerializer
 import org.apache.spark.sql.SparkSession
+import org.apache.spark.util.UninterruptibleThread
 
 
 /**
@@ -91,18 +92,30 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: 
SparkSession, path: String)
 serializer.deserialize[T](ByteBuffer.wrap(bytes))
   }
 
+  /**
+   * Store the metadata for the specified batchId and return `true` if 
successful. If the batchId's
+   * metadata has already been stored, this method will return `false`.
+   *
+   * Note that this method must be called on a 
[[org.apache.spark.util.UninterruptibleThread]]
+   * so that interrupts can be disabled while writing the batch file. This is 
because there is a
+   * potential dead-lock in Hadoop "Shell.runCommand" before 2.5.0 
(HADOOP-10622). If the thread
+   * running "Shell.runCommand" is interrupted, then the thread can get 
deadlocked. In our
+   * case, `writeBatch` creates a file using HDFS API and calls 
"Shell.runCommand" to set the
+   * file permissions, and can get deadlocked if the stream execution thread 
is stopped by
+   * interrupt. Hence, we make sure that this method is called on 
[[UninterruptibleThread]] which
+   * allows us to disable interrupts here. Also see SPARK-14131.
+   */
   override def add(batchId: Long, metadata: T): Boolean = {
 get(batchId).map(_ => false).getOrElse {
-  // Only write metadata when the batch has not yet been written.
-  try {
-writeBatch(batchId, serialize(metadata))
-true
-  } catch {
-case e: IOException if "java.lang.InterruptedException" == 
e.getMessage =>
-  // create may convert InterruptedException to IOException. Let's 
convert it back to
-  // InterruptedException so that this failure won't crash 
StreamExecution
-  throw new InterruptedException("Creating file is interrupted")
+  // Only write metadata when the batch has not yet been written
+  Thread.currentThread match {
+case ut: UninterruptibleThread =>
+  ut.runUninterruptibly { writeBatch(batchId, serialize(metadata)) }
+case _ =>
+  throw new IllegalStateException(
+"HDFSMetadataLog.add() must be executed on a 
o.a.spark.util.UninterruptibleThread")
   }
+  true
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/c979c8bb/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
--

spark git commit: [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog

2016-07-25 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 41e72f659 -> b17fe4e41


[SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in 
HDFSMetadataLog

## What changes were proposed in this pull request?
Current fix for deadlock disables interrupts in the StreamExecution which 
getting offsets for all sources, and when writing to any metadata log, to avoid 
potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, 
disabling interrupts can have unintended consequences in other sources. So I am 
making the fix more narrow, by disabling interrupt it only in the 
HDFSMetadataLog. This is a narrower fix for something risky like disabling 
interrupt.

## How was this patch tested?
Existing tests.

Author: Tathagata Das 

Closes #14292 from tdas/SPARK-14131.

(cherry picked from commit c979c8bba02bc89cb9ad81b212f085a8a5490a07)
Signed-off-by: Tathagata Das 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b17fe4e4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b17fe4e4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b17fe4e4

Branch: refs/heads/branch-2.0
Commit: b17fe4e412d27a4f3e8ad86ac5d8c2c108654eb3
Parents: 41e72f6
Author: Tathagata Das 
Authored: Mon Jul 25 16:09:22 2016 -0700
Committer: Tathagata Das 
Committed: Mon Jul 25 16:09:35 2016 -0700

--
 .../execution/streaming/HDFSMetadataLog.scala   | 31 ++
 .../execution/streaming/StreamExecution.scala   | 28 -
 .../streaming/FileStreamSinkLogSuite.scala  |  4 +-
 .../streaming/HDFSMetadataLogSuite.scala| 10 +++--
 .../apache/spark/sql/test/SQLTestUtils.scala| 43 +++-
 5 files changed, 80 insertions(+), 36 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b17fe4e4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
index 069e41b..698f07b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
@@ -32,6 +32,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.network.util.JavaUtils
 import org.apache.spark.serializer.JavaSerializer
 import org.apache.spark.sql.SparkSession
+import org.apache.spark.util.UninterruptibleThread
 
 
 /**
@@ -91,18 +92,30 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: 
SparkSession, path: String)
 serializer.deserialize[T](ByteBuffer.wrap(bytes))
   }
 
+  /**
+   * Store the metadata for the specified batchId and return `true` if 
successful. If the batchId's
+   * metadata has already been stored, this method will return `false`.
+   *
+   * Note that this method must be called on a 
[[org.apache.spark.util.UninterruptibleThread]]
+   * so that interrupts can be disabled while writing the batch file. This is 
because there is a
+   * potential dead-lock in Hadoop "Shell.runCommand" before 2.5.0 
(HADOOP-10622). If the thread
+   * running "Shell.runCommand" is interrupted, then the thread can get 
deadlocked. In our
+   * case, `writeBatch` creates a file using HDFS API and calls 
"Shell.runCommand" to set the
+   * file permissions, and can get deadlocked if the stream execution thread 
is stopped by
+   * interrupt. Hence, we make sure that this method is called on 
[[UninterruptibleThread]] which
+   * allows us to disable interrupts here. Also see SPARK-14131.
+   */
   override def add(batchId: Long, metadata: T): Boolean = {
 get(batchId).map(_ => false).getOrElse {
-  // Only write metadata when the batch has not yet been written.
-  try {
-writeBatch(batchId, serialize(metadata))
-true
-  } catch {
-case e: IOException if "java.lang.InterruptedException" == 
e.getMessage =>
-  // create may convert InterruptedException to IOException. Let's 
convert it back to
-  // InterruptedException so that this failure won't crash 
StreamExecution
-  throw new InterruptedException("Creating file is interrupted")
+  // Only write metadata when the batch has not yet been written
+  Thread.currentThread match {
+case ut: UninterruptibleThread =>
+  ut.runUninterruptibly { writeBatch(batchId, serialize(metadata)) }
+case _ =>
+  throw new IllegalStateException(
+"HDFSMetadataLog.add() must be executed on a 
o.a.spark.util.UninterruptibleThread")
   }
+  true
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b1

spark git commit: [SPARK-15590][WEBUI] Paginate Job Table in Jobs tab

2016-07-25 Thread zsxwing
Repository: spark
Updated Branches:
  refs/heads/master c979c8bba -> db36e1e75


[SPARK-15590][WEBUI] Paginate Job Table in Jobs tab

## What changes were proposed in this pull request?

This patch adds pagination support for the Job Tables in the Jobs tab. 
Pagination is provided for all of the three Job Tables (active, completed, and 
failed). Interactions (jumping, sorting, and setting page size) for paged 
tables are also included.

The diff didn't keep track of some lines based on the original ones. The 
function `makeRow`of the original `AllJobsPage.scala` is reused. They are 
separated at the beginning of the function `jobRow` (L427-439) and the function 
`row`(L594-618) in the new `AllJobsPage.scala`.

## How was this patch tested?

Tested manually by using checking the Web UI after completing and failing 
hundreds of jobs.
Generate completed jobs by:
```scala
val d = sc.parallelize(Array(1,2,3,4,5))
for(i <- 1 to 255){ var b = d.collect() }
```
Generate failed jobs by calling the following code multiple times:
```scala
var b = d.map(_/0).collect()
```
Interactions like jumping, sorting, and setting page size are all tested.

This shows the pagination for completed jobs:
![paginate success 
jobs](https://cloud.githubusercontent.com/assets/5558370/15986498/efa12ef6-303b-11e6-8b1d-c3382aeb9ad0.png)

This shows the sorting works in job tables:
![sorting](https://cloud.githubusercontent.com/assets/5558370/15986539/98c8a81a-303c-11e6-86f2-8d2bc7924ee9.png)

This shows the pagination for failed jobs and the effect of jumping and setting 
page size:
![paginate failed 
jobs](https://cloud.githubusercontent.com/assets/5558370/15986556/d8c1323e-303c-11e6-8e4b-7bdb030ea42b.png)

Author: Tao Lin 

Closes #13620 from nblintao/dev.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db36e1e7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db36e1e7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db36e1e7

Branch: refs/heads/master
Commit: db36e1e75d69d63b76312e85ae3a6c95cebbe65e
Parents: c979c8b
Author: Tao Lin 
Authored: Mon Jul 25 17:35:50 2016 -0700
Committer: Shixiong Zhu 
Committed: Mon Jul 25 17:35:50 2016 -0700

--
 .../org/apache/spark/ui/jobs/AllJobsPage.scala  | 369 ---
 .../org/apache/spark/ui/UISeleniumSuite.scala   |   5 +-
 2 files changed, 312 insertions(+), 62 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/db36e1e7/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala
--
diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala 
b/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala
index 035d706..e5363ce 100644
--- a/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala
+++ b/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala
@@ -17,17 +17,21 @@
 
 package org.apache.spark.ui.jobs
 
+import java.net.URLEncoder
 import java.util.Date
 import javax.servlet.http.HttpServletRequest
 
+import scala.collection.JavaConverters._
 import scala.collection.mutable.{HashMap, ListBuffer}
 import scala.xml._
 
 import org.apache.commons.lang3.StringEscapeUtils
 
 import org.apache.spark.JobExecutionStatus
-import org.apache.spark.ui.{ToolTips, UIUtils, WebUIPage}
-import org.apache.spark.ui.jobs.UIData.{ExecutorUIData, JobUIData}
+import org.apache.spark.scheduler.StageInfo
+import org.apache.spark.ui._
+import org.apache.spark.ui.jobs.UIData.{ExecutorUIData, JobUIData, StageUIData}
+import org.apache.spark.util.Utils
 
 /** Page showing list of all ongoing and recently finished jobs */
 private[ui] class AllJobsPage(parent: JobsTab) extends WebUIPage("") {
@@ -210,64 +214,69 @@ private[ui] class AllJobsPage(parent: JobsTab) extends 
WebUIPage("") {
 
   }
 
-  private def jobsTable(jobs: Seq[JobUIData]): Seq[Node] = {
+  private def jobsTable(
+  request: HttpServletRequest,
+  jobTag: String,
+  jobs: Seq[JobUIData]): Seq[Node] = {
+val allParameters = request.getParameterMap.asScala.toMap
+val parameterOtherTable = allParameters.filterNot(_._1.startsWith(jobTag))
+  .map(para => para._1 + "=" + para._2(0))
+
 val someJobHasJobGroup = jobs.exists(_.jobGroup.isDefined)
+val jobIdTitle = if (someJobHasJobGroup) "Job Id (Job Group)" else "Job Id"
 
-val columns: Seq[Node] = {
-  {if (someJobHasJobGroup) "Job Id (Job Group)" else "Job Id"}
-  Description
-  Submitted
-  Duration
-  Stages: Succeeded/Total
-  Tasks (for all stages): Succeeded/Total
-}
+val parameterJobPage = request.getParameter(jobTag + ".page")
+val parameterJobSortColumn = request.getParameter(jobTag + ".sort")
+val parameterJobSortDesc = request.getParameter(jobTag + ".desc")
+val pa

spark git commit: [SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when eventually fails

2016-07-25 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 b17fe4e41 -> 9d581dc61


[SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when 
eventually fails

## What changes were proposed in this pull request?

This PR moves `ssc.stop()` into `finally` for 
`StreamingContextSuite.createValidCheckpoint` to avoid leaking a 
StreamingContext since leaking a StreamingContext will fail a lot of tests and 
make us hard to find the real failure one.

## How was this patch tested?

Jenkins unit tests

Author: Shixiong Zhu 

Closes #14354 from zsxwing/ssc-leak.

(cherry picked from commit e164a04b2ba3503e5c14cd1cd4beb40e0b79925a)
Signed-off-by: Tathagata Das 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d581dc6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d581dc6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d581dc6

Branch: refs/heads/branch-2.0
Commit: 9d581dc61951eccf0f06868e0d3f10134f433e82
Parents: b17fe4e
Author: Shixiong Zhu 
Authored: Mon Jul 25 18:26:29 2016 -0700
Committer: Tathagata Das 
Committed: Mon Jul 25 18:26:37 2016 -0700

--
 .../org/apache/spark/streaming/StreamingContextSuite.scala  | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9d581dc6/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
--
diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
 
b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
index 806e181..f1482e5 100644
--- 
a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
+++ 
b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
@@ -819,10 +819,13 @@ class StreamingContextSuite extends SparkFunSuite with 
BeforeAndAfter with Timeo
 ssc.checkpoint(checkpointDirectory)
 ssc.textFileStream(testDirectory).foreachRDD { rdd => rdd.count() }
 ssc.start()
-eventually(timeout(1 millis)) {
-  assert(Checkpoint.getCheckpointFiles(checkpointDirectory).size > 1)
+try {
+  eventually(timeout(3 millis)) {
+assert(Checkpoint.getCheckpointFiles(checkpointDirectory).size > 1)
+  }
+} finally {
+  ssc.stop()
 }
-ssc.stop()
 checkpointDirectory
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when eventually fails

2016-07-25 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/master db36e1e75 -> e164a04b2


[SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when 
eventually fails

## What changes were proposed in this pull request?

This PR moves `ssc.stop()` into `finally` for 
`StreamingContextSuite.createValidCheckpoint` to avoid leaking a 
StreamingContext since leaking a StreamingContext will fail a lot of tests and 
make us hard to find the real failure one.

## How was this patch tested?

Jenkins unit tests

Author: Shixiong Zhu 

Closes #14354 from zsxwing/ssc-leak.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e164a04b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e164a04b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e164a04b

Branch: refs/heads/master
Commit: e164a04b2ba3503e5c14cd1cd4beb40e0b79925a
Parents: db36e1e
Author: Shixiong Zhu 
Authored: Mon Jul 25 18:26:29 2016 -0700
Committer: Tathagata Das 
Committed: Mon Jul 25 18:26:29 2016 -0700

--
 .../org/apache/spark/streaming/StreamingContextSuite.scala  | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e164a04b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
--
diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
 
b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
index 806e181..f1482e5 100644
--- 
a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
+++ 
b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala
@@ -819,10 +819,13 @@ class StreamingContextSuite extends SparkFunSuite with 
BeforeAndAfter with Timeo
 ssc.checkpoint(checkpointDirectory)
 ssc.textFileStream(testDirectory).foreachRDD { rdd => rdd.count() }
 ssc.start()
-eventually(timeout(1 millis)) {
-  assert(Checkpoint.getCheckpointFiles(checkpointDirectory).size > 1)
+try {
+  eventually(timeout(3 millis)) {
+assert(Checkpoint.getCheckpointFiles(checkpointDirectory).size > 1)
+  }
+} finally {
+  ssc.stop()
 }
-ssc.stop()
 checkpointDirectory
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs

2016-07-25 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master e164a04b2 -> 3fc456694


[SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs

## What changes were proposed in this pull request?
**Issue 1: Disallow Creating/Altering a View when the same-name Table Exists 
(without IF NOT EXISTS)**
When we create OR alter a view, we check whether the view already exists. In 
the current implementation, if a table with the same name exists, we treat it 
as a view. However, this is not the right behavior. We should follow what Hive 
does. For example,
```
hive> CREATE TABLE tab1 (id int);
OK
Time taken: 0.196 seconds
hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> ALTER VIEW tab1 AS SELECT * FROM t1;
FAILED: SemanticException [Error 10218]: Existing table is not a view
 The following is an existing table, not a view: default.tab1
hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
OK
Time taken: 0.678 seconds
```

**Issue 2: Strange Error when Issuing Load Table Against A View**
Users should not be allowed to issue LOAD DATA against a view. Currently, when 
users doing it, we got a very strange runtime error. For example,
```SQL
LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
```
```
java.lang.reflect.InvocationTargetException was thrown.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
```
## How was this patch tested?
Added test cases

Author: gatorsmile 

Closes #14314 from gatorsmile/tableDDLAgainstView.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3fc45669
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3fc45669
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3fc45669

Branch: refs/heads/master
Commit: 3fc456694151e766c551b4bc58ed7c945666
Parents: e164a04
Author: gatorsmile 
Authored: Tue Jul 26 09:32:29 2016 +0800
Committer: Wenchen Fan 
Committed: Tue Jul 26 09:32:29 2016 +0800

--
 .../sql/catalyst/catalog/SessionCatalog.scala   |  6 +-
 .../spark/sql/execution/command/tables.scala| 33 -
 .../spark/sql/execution/command/views.scala |  5 ++
 .../spark/sql/hive/execution/SQLViewSuite.scala | 71 +++-
 4 files changed, 96 insertions(+), 19 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3fc45669/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
index 134fc4e..1856dc4 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
@@ -443,12 +443,12 @@ class SessionCatalog(
   }
 
   /**
-   * Return whether a table with the specified name exists.
+   * Return whether a table/view with the specified name exists.
*
-   * Note: If a database is explicitly specified, then this will return 
whether the table
+   * Note: If a database is explicitly specified, then this will return 
whether the table/view
* exists in that particular database instead. In that case, even if there 
is a temporary
* table with the same name, we will return false if the specified database 
does not
-   * contain the table.
+   * contain the table/view.
*/
   def tableExists(name: TableIdentifier): Boolean = synchronized {
 val db = formatDatabaseName(name.database.getOrElse(currentDb))

http://git-wip-us.apache.org/repos/asf/spark/blob/3fc45669/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
index 8f3adad..c6daa95 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala
@@ -202,35 +202,38 @@ case class LoadDataCommand(
   override def run(sparkSession: SparkSession): Seq[Row] = {
 val ca

spark git commit: Fix description of spark.speculation.quantile

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 9d581dc61 -> 3d3547487


Fix description of spark.speculation.quantile

## What changes were proposed in this pull request?

Minor doc fix regarding the spark.speculation.quantile configuration parameter. 
 It incorrectly states it should be a percentage, when it should be a fraction.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

I tried building the documentation but got some unidoc errors.  I also got them 
when building off origin/master, so I don't think I caused that problem.  I did 
run the web app and saw the changes reflected as expected.

Author: Nicholas Brown 

Closes #14352 from nwbvt/master.

(cherry picked from commit ba0aade6d517364363e07ed09278c2b44110c33b)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3d354748
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3d354748
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3d354748

Branch: refs/heads/branch-2.0
Commit: 3d35474872d3b117abc3fc7debcb1eb6409769d6
Parents: 9d581dc
Author: Nicholas Brown 
Authored: Mon Jul 25 19:18:27 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 19:18:33 2016 -0700

--
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3d354748/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 86a9bd9..bf10b24 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1174,7 +1174,7 @@ Apart from these, the following properties are also 
available, and may be useful
   spark.speculation.quantile
   0.75
   
-Percentage of tasks which must be complete before speculation is enabled 
for a particular stage.
+Fraction of tasks which must be complete before speculation is enabled for 
a particular stage.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Fix description of spark.speculation.quantile

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 3fc456694 -> ba0aade6d


Fix description of spark.speculation.quantile

## What changes were proposed in this pull request?

Minor doc fix regarding the spark.speculation.quantile configuration parameter. 
 It incorrectly states it should be a percentage, when it should be a fraction.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

I tried building the documentation but got some unidoc errors.  I also got them 
when building off origin/master, so I don't think I caused that problem.  I did 
run the web app and saw the changes reflected as expected.

Author: Nicholas Brown 

Closes #14352 from nwbvt/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ba0aade6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ba0aade6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ba0aade6

Branch: refs/heads/master
Commit: ba0aade6d517364363e07ed09278c2b44110c33b
Parents: 3fc4566
Author: Nicholas Brown 
Authored: Mon Jul 25 19:18:27 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 19:18:27 2016 -0700

--
 docs/configuration.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ba0aade6/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 86a9bd9..bf10b24 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1174,7 +1174,7 @@ Apart from these, the following properties are also 
available, and may be useful
   spark.speculation.quantile
   0.75
   
-Percentage of tasks which must be complete before speculation is enabled 
for a particular stage.
+Fraction of tasks which must be complete before speculation is enabled for 
a particular stage.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master ba0aade6d -> 8a8d26f1e


[SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries

## What changes were proposed in this pull request?

Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* 
`EXISTS` queries. We had better prevent this.
```scala
scala> sql("CREATE TABLE t1(a int)")
scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
java.lang.UnsupportedOperationException: empty.reduceLeft
```

## How was this patch tested?

Pass the Jenkins tests with a new test suite.

Author: Dongjoon Hyun 

Closes #14307 from dongjoon-hyun/SPARK-16672.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a8d26f1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a8d26f1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a8d26f1

Branch: refs/heads/master
Commit: 8a8d26f1e27db5c2228307b1c3609b4713b9d0db
Parents: ba0aade
Author: Dongjoon Hyun 
Authored: Mon Jul 25 19:52:17 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 19:52:17 2016 -0700

--
 .../scala/org/apache/spark/sql/catalyst/SQLBuilder.scala  |  9 +++--
 sql/hive/src/test/resources/sqlgen/predicate_subquery.sql |  4 
 .../apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala | 10 ++
 3 files changed, 21 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8a8d26f1/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala
index a8cc72f..9a02e3c 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala
@@ -512,8 +512,13 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging 
{
   ScalarSubquery(rewrite, Seq.empty, exprId)
 
 case PredicateSubquery(query, conditions, false, exprId) =>
-  val plan = Project(Seq(Alias(Literal(1), "1")()),
-Filter(conditions.reduce(And), addSubqueryIfNeeded(query)))
+  val subquery = addSubqueryIfNeeded(query)
+  val plan = if (conditions.isEmpty) {
+subquery
+  } else {
+Project(Seq(Alias(Literal(1), "1")()),
+  Filter(conditions.reduce(And), subquery))
+  }
   Exists(plan, exprId)
 
 case PredicateSubquery(query, conditions, true, exprId) =>

http://git-wip-us.apache.org/repos/asf/spark/blob/8a8d26f1/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql
--
diff --git a/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql 
b/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql
new file mode 100644
index 000..2e06b4f
--- /dev/null
+++ b/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql
@@ -0,0 +1,4 @@
+-- This file is automatically generated by LogicalPlanToSQLSuite.
+select * from t1 b where exists (select * from t1 a)
+
+SELECT `gen_attr` AS `a` FROM (SELECT `gen_attr` FROM (SELECT `a` AS 
`gen_attr` FROM `default`.`t1`) AS gen_subquery_0 WHERE EXISTS(SELECT 
`gen_attr` AS `a` FROM ((SELECT `gen_attr` FROM (SELECT `a` AS `gen_attr` FROM 
`default`.`t1`) AS gen_subquery_0) AS gen_subquery_1) AS gen_subquery_1)) AS b

http://git-wip-us.apache.org/repos/asf/spark/blob/8a8d26f1/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
index 1f5078d..ebece38 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
@@ -25,6 +25,7 @@ import scala.util.control.NonFatal
 import org.apache.spark.sql.Column
 import org.apache.spark.sql.catalyst.parser.ParseException
 import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SQLTestUtils
 
 /**
@@ -927,6 +928,15 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
+  test("predicate subquery") {
+withTable("t1") {
+  withSQLConf(SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
+sql("CREATE TABLE t1(a int)

spark git commit: [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 3d3547487 -> aeb6d5c05


[SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries

## What changes were proposed in this pull request?

Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* 
`EXISTS` queries. We had better prevent this.
```scala
scala> sql("CREATE TABLE t1(a int)")
scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
java.lang.UnsupportedOperationException: empty.reduceLeft
```

## How was this patch tested?

Pass the Jenkins tests with a new test suite.

Author: Dongjoon Hyun 

Closes #14307 from dongjoon-hyun/SPARK-16672.

(cherry picked from commit 8a8d26f1e27db5c2228307b1c3609b4713b9d0db)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aeb6d5c0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aeb6d5c0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aeb6d5c0

Branch: refs/heads/branch-2.0
Commit: aeb6d5c053d4e848df0e7842a3994154df464647
Parents: 3d35474
Author: Dongjoon Hyun 
Authored: Mon Jul 25 19:52:17 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 19:52:22 2016 -0700

--
 .../scala/org/apache/spark/sql/catalyst/SQLBuilder.scala  |  9 +++--
 sql/hive/src/test/resources/sqlgen/predicate_subquery.sql |  4 
 .../apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala | 10 ++
 3 files changed, 21 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/aeb6d5c0/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala
index a8cc72f..9a02e3c 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala
@@ -512,8 +512,13 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging 
{
   ScalarSubquery(rewrite, Seq.empty, exprId)
 
 case PredicateSubquery(query, conditions, false, exprId) =>
-  val plan = Project(Seq(Alias(Literal(1), "1")()),
-Filter(conditions.reduce(And), addSubqueryIfNeeded(query)))
+  val subquery = addSubqueryIfNeeded(query)
+  val plan = if (conditions.isEmpty) {
+subquery
+  } else {
+Project(Seq(Alias(Literal(1), "1")()),
+  Filter(conditions.reduce(And), subquery))
+  }
   Exists(plan, exprId)
 
 case PredicateSubquery(query, conditions, true, exprId) =>

http://git-wip-us.apache.org/repos/asf/spark/blob/aeb6d5c0/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql
--
diff --git a/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql 
b/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql
new file mode 100644
index 000..2e06b4f
--- /dev/null
+++ b/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql
@@ -0,0 +1,4 @@
+-- This file is automatically generated by LogicalPlanToSQLSuite.
+select * from t1 b where exists (select * from t1 a)
+
+SELECT `gen_attr` AS `a` FROM (SELECT `gen_attr` FROM (SELECT `a` AS 
`gen_attr` FROM `default`.`t1`) AS gen_subquery_0 WHERE EXISTS(SELECT 
`gen_attr` AS `a` FROM ((SELECT `gen_attr` FROM (SELECT `a` AS `gen_attr` FROM 
`default`.`t1`) AS gen_subquery_0) AS gen_subquery_1) AS gen_subquery_1)) AS b

http://git-wip-us.apache.org/repos/asf/spark/blob/aeb6d5c0/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
--
diff --git 
a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
 
b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
index 1f5078d..ebece38 100644
--- 
a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
+++ 
b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala
@@ -25,6 +25,7 @@ import scala.util.control.NonFatal
 import org.apache.spark.sql.Column
 import org.apache.spark.sql.catalyst.parser.ParseException
 import org.apache.spark.sql.functions._
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.test.SQLTestUtils
 
 /**
@@ -927,6 +928,15 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with 
SQLTestUtils {
 }
   }
 
+  test("predicate subquery") {
+withTable("t

spark git commit: [SPARK-16724] Expose DefinedByConstructorParams

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 aeb6d5c05 -> 4b38a6a53


[SPARK-16724] Expose DefinedByConstructorParams

We don't generally make things in catalyst/execution private.  Instead they are 
just undocumented due to their lack of stability guarantees.

Author: Michael Armbrust 

Closes #14356 from marmbrus/patch-1.

(cherry picked from commit f99e34e8e58c97ff30c6e054875533350d99fe5b)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4b38a6a5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4b38a6a5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4b38a6a5

Branch: refs/heads/branch-2.0
Commit: 4b38a6a534d93b1eab3b19f62a2f78474be1d8bc
Parents: aeb6d5c
Author: Michael Armbrust 
Authored: Mon Jul 25 20:41:24 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 20:41:29 2016 -0700

--
 .../main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4b38a6a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
index 78c145d..8affb03 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
@@ -30,7 +30,7 @@ import org.apache.spark.unsafe.types.{CalendarInterval, 
UTF8String}
  * for classes whose fields are entirely defined by constructor params but 
should not be
  * case classes.
  */
-private[sql] trait DefinedByConstructorParams
+trait DefinedByConstructorParams
 
 
 /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16724] Expose DefinedByConstructorParams

2016-07-25 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master 8a8d26f1e -> f99e34e8e


[SPARK-16724] Expose DefinedByConstructorParams

We don't generally make things in catalyst/execution private.  Instead they are 
just undocumented due to their lack of stability guarantees.

Author: Michael Armbrust 

Closes #14356 from marmbrus/patch-1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f99e34e8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f99e34e8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f99e34e8

Branch: refs/heads/master
Commit: f99e34e8e58c97ff30c6e054875533350d99fe5b
Parents: 8a8d26f
Author: Michael Armbrust 
Authored: Mon Jul 25 20:41:24 2016 -0700
Committer: Reynold Xin 
Committed: Mon Jul 25 20:41:24 2016 -0700

--
 .../main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f99e34e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
index 78c145d..8affb03 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala
@@ -30,7 +30,7 @@ import org.apache.spark.unsafe.types.{CalendarInterval, 
UTF8String}
  * for classes whose fields are entirely defined by constructor params but 
should not be
  * case classes.
  */
-private[sql] trait DefinedByConstructorParams
+trait DefinedByConstructorParams
 
 
 /**


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions

2016-07-25 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master f99e34e8e -> 815f3eece


[SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead 
and lag functions

## What changes were proposed in this pull request?
This PR contains three changes.

First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, 
which is described as below:
1. lead/lag respect null input values, which means that if the offset row 
exists and the input value is null, the result will be null instead of the 
default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of 
its input (because of the first change).

Second, this PR fixes the evaluation of lead/lag when the input expression is a 
literal. This fix is a result of the first change. In current master, if a 
literal is used as the input expression of a lead or lag function, the result 
will be this literal even if the offset row does not exist.

Third, this PR makes ResolveWindowFrame not fire if a window function is not 
resolved.

## How was this patch tested?
New tests in SQLWindowFunctionSuite

Author: Yin Huai 

Closes #14284 from yhuai/lead-lag.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/815f3eec
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/815f3eec
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/815f3eec

Branch: refs/heads/master
Commit: 815f3eece5f095919a329af8cbd762b9ed71c7a8
Parents: f99e34e
Author: Yin Huai 
Authored: Mon Jul 25 20:58:07 2016 -0700
Committer: Yin Huai 
Committed: Mon Jul 25 20:58:07 2016 -0700

--
 .../spark/sql/catalyst/analysis/Analyzer.scala  |   3 +-
 .../expressions/windowExpressions.scala |  45 +-
 .../apache/spark/sql/execution/WindowExec.scala |  34 +-
 .../sql/execution/SQLWindowFunctionSuite.scala  | 414 +++
 .../hive/execution/SQLWindowFunctionSuite.scala | 370 -
 5 files changed, 467 insertions(+), 399 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/815f3eec/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index d1d2c59..61162cc 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1787,7 +1787,8 @@ class Analyzer(
 s @ WindowSpecDefinition(_, o, UnspecifiedFrame))
   if wf.frame != UnspecifiedFrame =>
   WindowExpression(wf, s.copy(frameSpecification = wf.frame))
-case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
UnspecifiedFrame)) =>
+case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
UnspecifiedFrame))
+  if e.resolved =>
   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
acceptWindowFrame = true)
   we.copy(windowSpec = s.copy(frameSpecification = frame))
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/815f3eec/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index e35192c..6806591 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -321,8 +321,7 @@ abstract class OffsetWindowFunction
   val input: Expression
 
   /**
-   * Default result value for the function when the input expression returns 
NULL. The default will
-   * evaluated against the current row instead of the offset row.
+   * Default result value for the function when the 'offset'th row does not 
exist.
*/
   val default: Expression
 
@@ -348,7 +347,7 @@ abstract class OffsetWindowFunction
*/
   override def foldable: Boolean = false
 
-  override def nullable: Boolean = default == null || default.nullable
+  override def nullable: Boolean = default == null || default.nullable || 
input.nullable
 
   override lazy val frame = {
 // This will be triggered by the Analyzer.
@@ -373,20 +372,22 @@ abstract class OffsetWindowFunction
 }
 
 /**
- * The Lead function returns the value of 'x' at 'offset' rows after the 
current row in the windo

spark git commit: [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions

2016-07-25 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 4b38a6a53 -> 4391d4a3c


[SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead 
and lag functions

## What changes were proposed in this pull request?
This PR contains three changes.

First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, 
which is described as below:
1. lead/lag respect null input values, which means that if the offset row 
exists and the input value is null, the result will be null instead of the 
default value.
2. If the offset row does not exist, the default value will be used.
3. OffsetWindowFunction's nullable setting also considers the nullability of 
its input (because of the first change).

Second, this PR fixes the evaluation of lead/lag when the input expression is a 
literal. This fix is a result of the first change. In current master, if a 
literal is used as the input expression of a lead or lag function, the result 
will be this literal even if the offset row does not exist.

Third, this PR makes ResolveWindowFrame not fire if a window function is not 
resolved.

## How was this patch tested?
New tests in SQLWindowFunctionSuite

Author: Yin Huai 

Closes #14284 from yhuai/lead-lag.

(cherry picked from commit 815f3eece5f095919a329af8cbd762b9ed71c7a8)
Signed-off-by: Yin Huai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4391d4a3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4391d4a3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4391d4a3

Branch: refs/heads/branch-2.0
Commit: 4391d4a3c60d59df625cbfdb918aa67c51ebcbc1
Parents: 4b38a6a
Author: Yin Huai 
Authored: Mon Jul 25 20:58:07 2016 -0700
Committer: Yin Huai 
Committed: Mon Jul 25 20:58:57 2016 -0700

--
 .../spark/sql/catalyst/analysis/Analyzer.scala  |   3 +-
 .../expressions/windowExpressions.scala |  45 +-
 .../apache/spark/sql/execution/WindowExec.scala |  34 +-
 .../sql/execution/SQLWindowFunctionSuite.scala  | 414 +++
 .../hive/execution/SQLWindowFunctionSuite.scala | 370 -
 5 files changed, 467 insertions(+), 399 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4391d4a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index d1d2c59..61162cc 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -1787,7 +1787,8 @@ class Analyzer(
 s @ WindowSpecDefinition(_, o, UnspecifiedFrame))
   if wf.frame != UnspecifiedFrame =>
   WindowExpression(wf, s.copy(frameSpecification = wf.frame))
-case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
UnspecifiedFrame)) =>
+case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
UnspecifiedFrame))
+  if e.resolved =>
   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
acceptWindowFrame = true)
   we.copy(windowSpec = s.copy(frameSpecification = frame))
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/4391d4a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
index e35192c..6806591 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala
@@ -321,8 +321,7 @@ abstract class OffsetWindowFunction
   val input: Expression
 
   /**
-   * Default result value for the function when the input expression returns 
NULL. The default will
-   * evaluated against the current row instead of the offset row.
+   * Default result value for the function when the 'offset'th row does not 
exist.
*/
   val default: Expression
 
@@ -348,7 +347,7 @@ abstract class OffsetWindowFunction
*/
   override def foldable: Boolean = false
 
-  override def nullable: Boolean = default == null || default.nullable
+  override def nullable: Boolean = default == null || default.nullable || 
input.nullable
 
   override lazy val frame = {
 // This will be triggered by the Analyzer.
@@ -373,20 +372,22 @@ abstract class OffsetWindowFunction
 }
 
 

spark git commit: [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning

2016-07-25 Thread wenchen
Repository: spark
Updated Branches:
  refs/heads/master 815f3eece -> 7b06a8948


[SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by 
ColumnPruning

## What changes were proposed in this pull request?

We push down `Project` through `Sample` in `Optimizer` by the rule 
`PushProjectThroughSample`. However, if the projected columns produce new 
output, they will encounter whole data instead of sampled data. It will bring 
some inconsistency between original plan (Sample then Project) and optimized 
plan (Project then Sample). In the extreme case such as attached in the JIRA, 
if the projected column is an UDF which is supposed to not see the sampled out 
data, the result of UDF will be incorrect.

Since the rule `ColumnPruning` already handles general `Project` pushdown. We 
don't need  `PushProjectThroughSample` anymore. The rule `ColumnPruning` also 
avoids the described issue.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh 

Closes #14327 from viirya/fix-sample-pushdown.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7b06a894
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7b06a894
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7b06a894

Branch: refs/heads/master
Commit: 7b06a8948fc16d3c14e240fdd632b79ce1651008
Parents: 815f3ee
Author: Liang-Chi Hsieh 
Authored: Tue Jul 26 12:00:01 2016 +0800
Committer: Wenchen Fan 
Committed: Tue Jul 26 12:00:01 2016 +0800

--
 .../sql/catalyst/optimizer/Optimizer.scala  | 12 --
 .../catalyst/optimizer/ColumnPruningSuite.scala | 15 
 .../optimizer/FilterPushdownSuite.scala | 17 -
 .../org/apache/spark/sql/DatasetSuite.scala | 25 
 4 files changed, 40 insertions(+), 29 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7b06a894/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
index c8e9d8e..fe328fd 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
@@ -76,7 +76,6 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, 
conf: CatalystConf)
 Batch("Operator Optimizations", fixedPoint,
   // Operator push down
   PushThroughSetOperations,
-  PushProjectThroughSample,
   ReorderJoin,
   EliminateOuterJoin,
   PushPredicateThroughJoin,
@@ -149,17 +148,6 @@ class SimpleTestOptimizer extends Optimizer(
   new SimpleCatalystConf(caseSensitiveAnalysis = true))
 
 /**
- * Pushes projects down beneath Sample to enable column pruning with sampling.
- */
-object PushProjectThroughSample extends Rule[LogicalPlan] {
-  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
-// Push down projection into sample
-case Project(projectList, Sample(lb, up, replace, seed, child)) =>
-  Sample(lb, up, replace, seed, Project(projectList, child))()
-  }
-}
-
-/**
  * Removes the Project only conducting Alias of its child node.
  * It is created mainly for removing extra Project added in 
EliminateSerialization rule,
  * but can also benefit other operators.

http://git-wip-us.apache.org/repos/asf/spark/blob/7b06a894/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
index b5664a5..589607e 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala
@@ -346,5 +346,20 @@ class ColumnPruningSuite extends PlanTest {
 comparePlans(Optimize.execute(plan1.analyze), correctAnswer1)
   }
 
+  test("push project down into sample") {
+val testRelation = LocalRelation('a.int, 'b.int, 'c.int)
+val x = testRelation.subquery('x)
+
+val query1 = Sample(0.0, 0.6, false, 11L, x)().select('a)
+val optimized1 = Optimize.execute(query1.analyze)
+val expected1 = Sample(0.0, 0.6, false, 11L, x.select('a))()
+comparePlans(optimized1, expected1.analyze)
+
+val query2 = Sample(0.0, 0.6, false, 11L, x)().select('a as 'aa)
+val optimized2 = Optimize.execute(query2.analyze)
+val expected2 = Sample(0.0, 0.6, f

[spark] Git Push Summary

2016-07-25 Thread rxin
Repository: spark
Updated Tags:  refs/tags/v2.0.0 [created] 13650fc58

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Git Push Summary

2016-07-25 Thread rxin
Repository: spark
Updated Tags:  refs/tags/v2.0.0-rc5 [deleted] 13650fc58

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Git Push Summary

2016-07-25 Thread rxin
Repository: spark
Updated Tags:  refs/tags/v2.0.0-rc4 [deleted] e5f8c1117

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Git Push Summary

2016-07-25 Thread rxin
Repository: spark
Updated Tags:  refs/tags/v2.0.0-rc2 [deleted] 4a55b2326

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Git Push Summary

2016-07-25 Thread rxin
Repository: spark
Updated Tags:  refs/tags/v2.0.0-rc3 [deleted] 48d1fa3e7

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] Git Push Summary

2016-07-25 Thread rxin
Repository: spark
Updated Tags:  refs/tags/v2.0.0-rc1 [deleted] 0c66ca41a

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org