spark git commit: [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last
Repository: spark Updated Branches: refs/heads/master 468a3c3ac -> 68b4020d0 [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last ## What changes were proposed in this pull request? Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and when both constructor arguments are the same, e.g.: ```sql LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE LAST_VALUE(FALSE, FALSE) LAST_VALUE(TRUE, TRUE) ``` This is because although `Last` is a unary expression, both of its constructor arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the same value, `TreeNode.withNewChildren` treats both of them as child nodes by mistake. `First` is also affected by this issue in exactly the same way. This PR fixes this issue by making `ignoreNullsExpr` a child expression of `First` and `Last`. ## How was this patch tested? New test case added in `WindowQuerySuite`. Author: Cheng Lian Closes #14295 from liancheng/spark-16648-last-value. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/68b4020d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/68b4020d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/68b4020d Branch: refs/heads/master Commit: 68b4020d0c0d4f063facfbf4639ef4251dcfda8b Parents: 468a3c3 Author: Cheng Lian Authored: Mon Jul 25 17:22:29 2016 +0800 Committer: Wenchen Fan Committed: Mon Jul 25 17:22:29 2016 +0800 -- .../sql/catalyst/expressions/aggregate/First.scala | 4 ++-- .../spark/sql/catalyst/expressions/aggregate/Last.scala | 4 ++-- .../spark/sql/hive/execution/WindowQuerySuite.scala | 12 3 files changed, 16 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/68b4020d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala index 946b3d4..d702c08 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala @@ -43,7 +43,7 @@ case class First(child: Expression, ignoreNullsExpr: Expression) extends Declara throw new AnalysisException("The second argument of First should be a boolean literal.") } - override def children: Seq[Expression] = child :: Nil + override def children: Seq[Expression] = child :: ignoreNullsExpr :: Nil override def nullable: Boolean = true @@ -54,7 +54,7 @@ case class First(child: Expression, ignoreNullsExpr: Expression) extends Declara override def dataType: DataType = child.dataType // Expected input data type. - override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, BooleanType) private lazy val first = AttributeReference("first", child.dataType)() http://git-wip-us.apache.org/repos/asf/spark/blob/68b4020d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala index 53b4b76..af88403 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala @@ -40,7 +40,7 @@ case class Last(child: Expression, ignoreNullsExpr: Expression) extends Declarat throw new AnalysisException("The second argument of First should be a boolean literal.") } - override def children: Seq[Expression] = child :: Nil + override def children: Seq[Expression] = child :: ignoreNullsExpr :: Nil override def nullable: Boolean = true @@ -51,7 +51,7 @@ case class Last(child: Expression, ignoreNullsExpr: Expression) extends Declarat override def dataType: DataType = child.dataType // Expected input data type. - override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, BooleanType) private lazy val last = AttributeReference("last", child.dataType)() http://git-wip-us.apache.org/repos/asf/spark/blob/68b4020d/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/WindowQuerySuite.scala --
spark git commit: [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last
Repository: spark Updated Branches: refs/heads/branch-2.0 d226dce12 -> fcbb7f653 [SPARK-16648][SQL] Make ignoreNullsExpr a child expression of First and Last ## What changes were proposed in this pull request? Default `TreeNode.withNewChildren` implementation doesn't work for `Last` and when both constructor arguments are the same, e.g.: ```sql LAST_VALUE(FALSE) -- The 2nd argument defaults to FALSE LAST_VALUE(FALSE, FALSE) LAST_VALUE(TRUE, TRUE) ``` This is because although `Last` is a unary expression, both of its constructor arguments, `child` and `ignoreNullsExpr`, are `Expression`s. When they have the same value, `TreeNode.withNewChildren` treats both of them as child nodes by mistake. `First` is also affected by this issue in exactly the same way. This PR fixes this issue by making `ignoreNullsExpr` a child expression of `First` and `Last`. ## How was this patch tested? New test case added in `WindowQuerySuite`. Author: Cheng Lian Closes #14295 from liancheng/spark-16648-last-value. (cherry picked from commit 68b4020d0c0d4f063facfbf4639ef4251dcfda8b) Signed-off-by: Wenchen Fan Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fcbb7f65 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fcbb7f65 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fcbb7f65 Branch: refs/heads/branch-2.0 Commit: fcbb7f653df11d923a208c5af03c0a6b9a472376 Parents: d226dce Author: Cheng Lian Authored: Mon Jul 25 17:22:29 2016 +0800 Committer: Wenchen Fan Committed: Mon Jul 25 17:25:19 2016 +0800 -- .../sql/catalyst/expressions/aggregate/First.scala | 4 ++-- .../spark/sql/catalyst/expressions/aggregate/Last.scala | 4 ++-- .../spark/sql/hive/execution/WindowQuerySuite.scala | 12 3 files changed, 16 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fcbb7f65/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala index 946b3d4..d702c08 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala @@ -43,7 +43,7 @@ case class First(child: Expression, ignoreNullsExpr: Expression) extends Declara throw new AnalysisException("The second argument of First should be a boolean literal.") } - override def children: Seq[Expression] = child :: Nil + override def children: Seq[Expression] = child :: ignoreNullsExpr :: Nil override def nullable: Boolean = true @@ -54,7 +54,7 @@ case class First(child: Expression, ignoreNullsExpr: Expression) extends Declara override def dataType: DataType = child.dataType // Expected input data type. - override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, BooleanType) private lazy val first = AttributeReference("first", child.dataType)() http://git-wip-us.apache.org/repos/asf/spark/blob/fcbb7f65/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala index 53b4b76..af88403 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Last.scala @@ -40,7 +40,7 @@ case class Last(child: Expression, ignoreNullsExpr: Expression) extends Declarat throw new AnalysisException("The second argument of First should be a boolean literal.") } - override def children: Seq[Expression] = child :: Nil + override def children: Seq[Expression] = child :: ignoreNullsExpr :: Nil override def nullable: Boolean = true @@ -51,7 +51,7 @@ case class Last(child: Expression, ignoreNullsExpr: Expression) extends Declarat override def dataType: DataType = child.dataType // Expected input data type. - override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType) + override def inputTypes: Seq[AbstractDataType] = Seq(AnyDataType, BooleanType) private lazy val last = AttributeReference("last", child.dataType)() http://git-wip-us.apache.org/repos/asf/spark/blob/fcbb7f65/sql/hive/src/test/scala/org/apache/
spark git commit: [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading
Repository: spark Updated Branches: refs/heads/master 68b4020d0 -> 7ffd99ec5 [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when reading ## What changes were proposed in this pull request? Currently, `JDBCRDD.compute` is doing type dispatch for each row to read appropriate values. It might not have to be done like this because the schema is already kept in `JDBCRDD`. So, appropriate converters can be created first according to the schema, and then apply them to each row. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon Closes #14313 from HyukjinKwon/SPARK-16674. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ffd99ec Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ffd99ec Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ffd99ec Branch: refs/heads/master Commit: 7ffd99ec5f267730734431097cbb700ad074bebe Parents: 68b4020 Author: hyukjinkwon Authored: Mon Jul 25 19:57:47 2016 +0800 Committer: Wenchen Fan Committed: Mon Jul 25 19:57:47 2016 +0800 -- .../execution/datasources/jdbc/JDBCRDD.scala| 245 ++- 1 file changed, 129 insertions(+), 116 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7ffd99ec/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala index 24e2c1a..4c98430 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala @@ -28,7 +28,7 @@ import org.apache.spark.{Partition, SparkContext, TaskContext} import org.apache.spark.internal.Logging import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions.SpecificMutableRow +import org.apache.spark.sql.catalyst.expressions.{MutableRow, SpecificMutableRow} import org.apache.spark.sql.catalyst.util.{DateTimeUtils, GenericArrayData} import org.apache.spark.sql.jdbc.JdbcDialects import org.apache.spark.sql.sources._ @@ -322,43 +322,134 @@ private[sql] class JDBCRDD( } } - // Each JDBC-to-Catalyst conversion corresponds to a tag defined here so that - // we don't have to potentially poke around in the Metadata once for every - // row. - // Is there a better way to do this? I'd rather be using a type that - // contains only the tags I define. - abstract class JDBCConversion - case object BooleanConversion extends JDBCConversion - case object DateConversion extends JDBCConversion - case class DecimalConversion(precision: Int, scale: Int) extends JDBCConversion - case object DoubleConversion extends JDBCConversion - case object FloatConversion extends JDBCConversion - case object IntegerConversion extends JDBCConversion - case object LongConversion extends JDBCConversion - case object BinaryLongConversion extends JDBCConversion - case object StringConversion extends JDBCConversion - case object TimestampConversion extends JDBCConversion - case object BinaryConversion extends JDBCConversion - case class ArrayConversion(elementConversion: JDBCConversion) extends JDBCConversion + // A `JDBCValueSetter` is responsible for converting and setting a value from `ResultSet` + // into a field for `MutableRow`. The last argument `Int` means the index for the + // value to be set in the row and also used for the value to retrieve from `ResultSet`. + private type JDBCValueSetter = (ResultSet, MutableRow, Int) => Unit /** - * Maps a StructType to a type tag list. + * Creates `JDBCValueSetter`s according to [[StructType]], which can set + * each value from `ResultSet` to each field of [[MutableRow]] correctly. */ - def getConversions(schema: StructType): Array[JDBCConversion] = -schema.fields.map(sf => getConversions(sf.dataType, sf.metadata)) - - private def getConversions(dt: DataType, metadata: Metadata): JDBCConversion = dt match { -case BooleanType => BooleanConversion -case DateType => DateConversion -case DecimalType.Fixed(p, s) => DecimalConversion(p, s) -case DoubleType => DoubleConversion -case FloatType => FloatConversion -case IntegerType => IntegerConversion -case LongType => if (metadata.contains("binarylong")) BinaryLongConversion else LongConversion -case StringType => StringConversion -case TimestampType => TimestampConversion -case BinaryType => BinaryConversion -case ArrayType(et, _) => ArrayConversion(getConversions(et, metadata)) +
spark git commit: [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable
Repository: spark Updated Branches: refs/heads/master 7ffd99ec5 -> d27d362eb [SPARK-16660][SQL] CreateViewCommand should not take CatalogTable ## What changes were proposed in this pull request? `CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`. This PR cleans it up and only pass in necessary information to `CreateViewCommand`. ## How was this patch tested? existing tests. Author: Wenchen Fan Closes #14297 from cloud-fan/minor2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d27d362e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d27d362e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d27d362e Branch: refs/heads/master Commit: d27d362ebae0c4a5cc6c99f13ef20049214dd4f9 Parents: 7ffd99e Author: Wenchen Fan Authored: Mon Jul 25 22:02:00 2016 +0800 Committer: Cheng Lian Committed: Mon Jul 25 22:02:00 2016 +0800 -- .../spark/sql/catalyst/catalog/interface.scala | 6 +- .../scala/org/apache/spark/sql/Dataset.scala| 27 ++--- .../spark/sql/execution/SparkSqlParser.scala| 51 - .../spark/sql/execution/command/views.scala | 111 ++- .../spark/sql/hive/HiveMetastoreCatalog.scala | 2 - .../spark/sql/hive/HiveDDLCommandSuite.scala| 46 +++- 6 files changed, 116 insertions(+), 127 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d27d362e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index b7f35b3..2a20651 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -81,9 +81,9 @@ object CatalogStorageFormat { */ case class CatalogColumn( name: String, -// This may be null when used to create views. TODO: make this type-safe; this is left -// as a string due to issues in converting Hive varchars to and from SparkSQL strings. -@Nullable dataType: String, +// TODO: make this type-safe; this is left as a string due to issues in converting Hive +// varchars to and from SparkSQL strings. +dataType: String, nullable: Boolean = true, comment: Option[String] = None) { http://git-wip-us.apache.org/repos/asf/spark/blob/d27d362e/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index b28ecb7..8b6443c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -2421,13 +2421,7 @@ class Dataset[T] private[sql]( */ @throws[AnalysisException] def createTempView(viewName: String): Unit = withPlan { -val tableDesc = CatalogTable( - identifier = sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName), - tableType = CatalogTableType.VIEW, - schema = Seq.empty[CatalogColumn], - storage = CatalogStorageFormat.empty) -CreateViewCommand(tableDesc, logicalPlan, allowExisting = false, replace = false, - isTemporary = true) +createViewCommand(viewName, replace = false) } /** @@ -2438,12 +2432,19 @@ class Dataset[T] private[sql]( * @since 2.0.0 */ def createOrReplaceTempView(viewName: String): Unit = withPlan { -val tableDesc = CatalogTable( - identifier = sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName), - tableType = CatalogTableType.VIEW, - schema = Seq.empty[CatalogColumn], - storage = CatalogStorageFormat.empty) -CreateViewCommand(tableDesc, logicalPlan, allowExisting = false, replace = true, +createViewCommand(viewName, replace = true) + } + + private def createViewCommand(viewName: String, replace: Boolean): CreateViewCommand = { +CreateViewCommand( + name = sparkSession.sessionState.sqlParser.parseTableIdentifier(viewName), + userSpecifiedColumns = Nil, + comment = None, + properties = Map.empty, + originalText = None, + child = logicalPlan, + allowExisting = false, + replace = replace, isTemporary = true) } http://git-wip-us.apache.org/repos/asf/spark/blob/d27d362e/sql/core/src/main/scala/org/apache
spark git commit: [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable
Repository: spark Updated Branches: refs/heads/master d27d362eb -> 64529b186 [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTable ## What changes were proposed in this pull request? It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module. ## How was this patch tested? existing tests. Author: Wenchen Fan Closes #14331 from cloud-fan/check. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/64529b18 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/64529b18 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/64529b18 Branch: refs/heads/master Commit: 64529b186a1c33740067cc7639d630bc5b9ae6e8 Parents: d27d362 Author: Wenchen Fan Authored: Mon Jul 25 22:05:48 2016 +0800 Committer: Cheng Lian Committed: Mon Jul 25 22:05:48 2016 +0800 -- .../spark/sql/catalyst/catalog/interface.scala | 49 .../catalyst/catalog/ExternalCatalogSuite.scala | 2 +- .../org/apache/spark/sql/DataFrameWriter.scala | 5 +- .../spark/sql/execution/command/ddl.scala | 3 +- .../spark/sql/execution/command/tables.scala| 30 +- .../execution/datasources/BucketingUtils.scala | 39 + .../sql/execution/datasources/DataSource.scala | 1 + .../datasources/FileSourceStrategy.scala| 1 + .../InsertIntoHadoopFsRelationCommand.scala | 2 +- .../execution/datasources/WriterContainer.scala | 1 + .../sql/execution/datasources/bucket.scala | 59 .../spark/sql/execution/datasources/ddl.scala | 1 + .../datasources/fileSourceInterfaces.scala | 2 +- .../apache/spark/sql/internal/CatalogImpl.scala | 2 +- .../sql/execution/command/DDLCommandSuite.scala | 6 +- .../spark/sql/execution/command/DDLSuite.scala | 3 +- .../datasources/FileSourceStrategySuite.scala | 1 + .../spark/sql/internal/CatalogSuite.scala | 5 +- .../sql/sources/CreateTableAsSelectSuite.scala | 2 +- .../spark/sql/hive/client/HiveClientImpl.scala | 9 +-- .../spark/sql/hive/HiveDDLCommandSuite.scala| 8 +-- .../spark/sql/sources/BucketedReadSuite.scala | 3 +- 22 files changed, 117 insertions(+), 117 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/64529b18/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index 2a20651..710bce5 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -25,6 +25,7 @@ import org.apache.spark.sql.catalyst.{FunctionIdentifier, TableIdentifier} import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference} import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.apache.spark.sql.catalyst.plans.logical.{LeafNode, LogicalPlan} +import org.apache.spark.sql.catalyst.util.quoteIdentifier /** @@ -110,6 +111,24 @@ case class CatalogTablePartition( /** + * A container for bucketing information. + * Bucketing is a technology for decomposing data sets into more manageable parts, and the number + * of buckets is fixed so it does not fluctuate with data. + * + * @param numBuckets number of buckets. + * @param bucketColumnNames the names of the columns that used to generate the bucket id. + * @param sortColumnNames the names of the columns that used to sort data in each bucket. + */ +case class BucketSpec( +numBuckets: Int, +bucketColumnNames: Seq[String], +sortColumnNames: Seq[String]) { + if (numBuckets <= 0) { +throw new AnalysisException(s"Expected positive number of buckets, but got `$numBuckets`.") + } +} + +/** * A table defined in the catalog. * * Note that Hive's metastore also tracks skewed columns. We should consider adding that in the @@ -124,9 +143,7 @@ case class CatalogTable( storage: CatalogStorageFormat, schema: Seq[CatalogColumn], partitionColumnNames: Seq[String] = Seq.empty, -sortColumnNames: Seq[String] = Seq.empty, -bucketColumnNames: Seq[String] = Seq.empty, -numBuckets: Int = -1, +bucketSpec: Option[BucketSpec] = None, owner: String = "", createTime: Long = System.currentTimeMillis, lastAccessTime: Long = -1, @@ -143,8 +160,8 @@ case class CatalogTable( s"must be a subset of schema (${colNames.mkString(", ")}) in table '$identifier'") } requireSubsetOfSchema(partitionColumnNames, "partition") - requireSubsetOfSch
spark git commit: [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages
Repository: spark Updated Branches: refs/heads/master 64529b186 -> d6a52176a [SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages ## What changes were proposed in this pull request? This patch adds an explicit test for [SPARK-14217] by setting the parquet dictionary and page size the generated parquet file spans across 3 pages (within a single row group) where the first page is dictionary encoded and the remaining two are plain encoded. ## How was this patch tested? 1. ParquetEncodingSuite 2. Also manually tested that this test fails without https://github.com/apache/spark/pull/12279 Author: Sameer Agarwal Closes #14304 from sameeragarwal/hybrid-encoding-test. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d6a52176 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d6a52176 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d6a52176 Branch: refs/heads/master Commit: d6a52176ade92853f37167ad27631977dc79bc76 Parents: 64529b1 Author: Sameer Agarwal Authored: Mon Jul 25 22:31:01 2016 +0800 Committer: Cheng Lian Committed: Mon Jul 25 22:31:01 2016 +0800 -- .../parquet/ParquetEncodingSuite.scala | 29 1 file changed, 29 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d6a52176/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala index 88fcfce..c754188 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala @@ -16,6 +16,10 @@ */ package org.apache.spark.sql.execution.datasources.parquet +import scala.collection.JavaConverters._ + +import org.apache.parquet.hadoop.ParquetOutputFormat + import org.apache.spark.sql.test.SharedSQLContext // TODO: this needs a lot more testing but it's currently not easy to test with the parquet @@ -78,4 +82,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex }} } } + + test("Read row group containing both dictionary and plain encoded pages") { +withSQLConf(ParquetOutputFormat.DICTIONARY_PAGE_SIZE -> "2048", + ParquetOutputFormat.PAGE_SIZE -> "4096") { + withTempPath { dir => +// In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size +// such that the following data spans across 3 pages (within a single row group) where the +// first page is dictionary encoded and the remaining two are plain encoded. +val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString)) +data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath) +val file = SpecificParquetRecordReaderBase.listDirectory(dir).asScala.head + +val reader = new VectorizedParquetRecordReader +reader.initialize(file, null /* set columns to null to project all columns */) +val column = reader.resultBatch().column(0) +assert(reader.nextBatch()) + +(0 until 512).foreach { i => + assert(column.getUTF8String(3 * i).toString == i.toString) + assert(column.getUTF8String(3 * i + 1).toString == i.toString) + assert(column.getUTF8String(3 * i + 2).toString == i.toString) +} + } +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat
Repository: spark Updated Branches: refs/heads/master d6a52176a -> 79826f3c7 [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat ## What changes were proposed in this pull request? It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698. Field name having dots throws an exception. For example the codes below: ```scala val path = "/tmp/path" val json =""" {"a.b":"data"}""" spark.sparkContext .parallelize(json :: Nil) .saveAsTextFile(path) spark.read.json(path).collect() ``` throws an exception as below: ``` Unable to resolve a.b given [a.b]; org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b]; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at scala.Option.getOrElse(Option.scala:121) ``` This problem was introduced in https://github.com/apache/spark/commit/17eec0a71ba8713c559d641e3f43a1be726b037c#diff-27c76f96a7b2733ecfd6f46a1716e153R121 When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields. For example, this throws an exception. (**Loading JSON from RDD is fine**) ```scala val json =""" {"a.b":"data"}""" val rdd = spark.sparkContext.parallelize(json :: Nil) spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true .json(rdd).select("`a.b`").printSchema() ``` as below: ``` cannot resolve '```a.b```' given input columns: [`a.b`]; org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` ## How was this patch tested? Unit tests in `FileSourceStrategySuite`. Author: hyukjinkwon Closes #14339 from HyukjinKwon/SPARK-16698-regression. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79826f3c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79826f3c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79826f3c Branch: refs/heads/master Commit: 79826f3c7936ee27457d030c7115d5cac69befd7 Parents: d6a5217 Author: hyukjinkwon Authored: Mon Jul 25 22:51:30 2016 +0800 Committer: Cheng Lian Committed: Mon Jul 25 22:51:30 2016 +0800 -- .../sql/catalyst/plans/logical/LogicalPlan.scala | 2 +- .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 15 +++ 2 files changed, 16 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/79826f3c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala index d0b2b5d..6d77991 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala @@ -127,7 +127,7 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { */ def resolve(schema: StructType, resolver: Resolver): Seq[Attribute] = { schema.map { field => - resolveQuoted(field.name, resolver).map { + resolve(field.name :: Nil, resolver).map { case a: AttributeReference => a case other => sys.error(s"can not handle nested schema yet... plan $this") }.getOrElse { http://git-wip-us.apache.org/repos/asf/spark/blob/79826f3c/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index aa80d61..06cc2a5 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -2982,4 +2982,19 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { """.stripMargin), Nil) } } + + test("SPARK-16674: field names containing dots for both fields and partitioned fields") { +withTempPath { path => + val data = (1 to 10).map(i => (i, s"data-$i", i % 2,
spark git commit: [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat
Repository: spark Updated Branches: refs/heads/branch-2.0 fcbb7f653 -> b52e639a8 [SPARK-16698][SQL] Field names having dots should be allowed for datasources based on FileFormat ## What changes were proposed in this pull request? It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698. Field name having dots throws an exception. For example the codes below: ```scala val path = "/tmp/path" val json =""" {"a.b":"data"}""" spark.sparkContext .parallelize(json :: Nil) .saveAsTextFile(path) spark.read.json(path).collect() ``` throws an exception as below: ``` Unable to resolve a.b given [a.b]; org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b]; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at scala.Option.getOrElse(Option.scala:121) ``` This problem was introduced in https://github.com/apache/spark/commit/17eec0a71ba8713c559d641e3f43a1be726b037c#diff-27c76f96a7b2733ecfd6f46a1716e153R121 When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields. For example, this throws an exception. (**Loading JSON from RDD is fine**) ```scala val json =""" {"a.b":"data"}""" val rdd = spark.sparkContext.parallelize(json :: Nil) spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true .json(rdd).select("`a.b`").printSchema() ``` as below: ``` cannot resolve '```a.b```' given input columns: [`a.b`]; org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` ## How was this patch tested? Unit tests in `FileSourceStrategySuite`. Author: hyukjinkwon Closes #14339 from HyukjinKwon/SPARK-16698-regression. (cherry picked from commit 79826f3c7936ee27457d030c7115d5cac69befd7) Signed-off-by: Cheng Lian Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b52e639a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b52e639a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b52e639a Branch: refs/heads/branch-2.0 Commit: b52e639a84a851e0b9159a0f6dae92664425042e Parents: fcbb7f6 Author: hyukjinkwon Authored: Mon Jul 25 22:51:30 2016 +0800 Committer: Cheng Lian Committed: Mon Jul 25 22:51:56 2016 +0800 -- .../sql/catalyst/plans/logical/LogicalPlan.scala | 2 +- .../scala/org/apache/spark/sql/SQLQuerySuite.scala | 15 +++ 2 files changed, 16 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b52e639a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala index d0b2b5d..6d77991 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala @@ -127,7 +127,7 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { */ def resolve(schema: StructType, resolver: Resolver): Seq[Attribute] = { schema.map { field => - resolveQuoted(field.name, resolver).map { + resolve(field.name :: Nil, resolver).map { case a: AttributeReference => a case other => sys.error(s"can not handle nested schema yet... plan $this") }.getOrElse { http://git-wip-us.apache.org/repos/asf/spark/blob/b52e639a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index f1a2410..be84dff 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -2946,4 +2946,19 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { """.stripMargin), Nil) } } + + test("SPARK-16674: field names containing dots for both fields and partit
spark git commit: [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions
Repository: spark Updated Branches: refs/heads/master 79826f3c7 -> 7ea6d282b [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions ## What changes were proposed in this pull request? This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no partitioning expressions are present. Before: ```sql ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ``` After: ```sql (ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ``` ## How was this patch tested? New test case added in `ExpressionSQLBuilderSuite`. Author: Cheng Lian Closes #14334 from liancheng/window-spec-sql-format. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ea6d282 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ea6d282 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ea6d282 Branch: refs/heads/master Commit: 7ea6d282b925819ddb3874a67b3c9da8cc41f131 Parents: 79826f3 Author: Cheng Lian Authored: Mon Jul 25 09:42:39 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 09:42:39 2016 -0700 -- .../expressions/windowExpressions.scala | 6 ++-- .../sqlgen/aggregate_functions_and_window.sql | 2 +- .../sqlgen/regular_expressions_and_window.sql | 2 +- .../test/resources/sqlgen/window_basic_1.sql| 2 +- .../test/resources/sqlgen/window_basic_2.sql| 2 +- .../catalyst/ExpressionSQLBuilderSuite.scala| 35 ++-- 6 files changed, 40 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7ea6d282/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index c0b453d..e35192c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -82,16 +82,16 @@ case class WindowSpecDefinition( val partition = if (partitionSpec.isEmpty) { "" } else { - "PARTITION BY " + partitionSpec.map(_.sql).mkString(", ") + "PARTITION BY " + partitionSpec.map(_.sql).mkString(", ") + " " } val order = if (orderSpec.isEmpty) { "" } else { - "ORDER BY " + orderSpec.map(_.sql).mkString(", ") + "ORDER BY " + orderSpec.map(_.sql).mkString(", ") + " " } -s"($partition $order ${frameSpecification.toString})" +s"($partition$order${frameSpecification.toString})" } } http://git-wip-us.apache.org/repos/asf/spark/blob/7ea6d282/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql -- diff --git a/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql b/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql index 3a29bcf..c94f53b 100644 --- a/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql +++ b/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql @@ -1,4 +1,4 @@ -- This file is automatically generated by LogicalPlanToSQLSuite. SELECT MAX(c) + COUNT(a) OVER () FROM parquet_t2 GROUP BY a, b -SELECT `gen_attr` AS `(max(c) + count(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))` FROM (SELECT (`gen_attr` + `gen_attr`) AS `gen_attr` FROM (SELECT gen_subquery_1.`gen_attr`, gen_subquery_1.`gen_attr`, count(`gen_attr`) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `gen_attr` FROM (SELECT max(`gen_attr`) AS `gen_attr`, `gen_attr` FROM (SELECT `a` AS `gen_attr`, `b` AS `gen_attr`, `c` AS `gen_attr`, `d` AS `gen_attr` FROM `default`.`parquet_t2`) AS gen_subquery_0 GROUP BY `gen_attr`, `gen_attr`) AS gen_subquery_1) AS gen_subquery_2) AS gen_subquery_3 +SELECT `gen_attr` AS `(max(c) + count(a) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))` FROM (SELECT (`gen_attr` + `gen_attr`) AS `gen_attr` FROM (SELECT gen_subquery_1.`gen_attr`, gen_subquery_1.`gen_attr`, count(`gen_attr`) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `gen_attr` FROM (SELECT max(`gen_attr`) AS `gen_attr`, `gen_attr` FROM (SELECT `a` AS `gen_attr`, `b` AS `gen_attr`, `c` AS `gen_attr`, `d` AS `gen_attr` FROM `default`.`parquet_t2`) AS gen_subquery_0 GROUP BY `gen_attr`, `gen_attr`) AS gen_subquery_1) AS gen_subquery_2) AS gen_subquery_3 http://git-wip-us.apache.org/repos/asf/s
spark git commit: [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions
Repository: spark Updated Branches: refs/heads/branch-2.0 b52e639a8 -> 57d65e511 [SPARK-16703][SQL] Remove extra whitespace in SQL generation for window functions ## What changes were proposed in this pull request? This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no partitioning expressions are present. Before: ```sql ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ``` After: ```sql (ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ``` ## How was this patch tested? New test case added in `ExpressionSQLBuilderSuite`. Author: Cheng Lian Closes #14334 from liancheng/window-spec-sql-format. (cherry picked from commit 7ea6d282b925819ddb3874a67b3c9da8cc41f131) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/57d65e51 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/57d65e51 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/57d65e51 Branch: refs/heads/branch-2.0 Commit: 57d65e5111e281d3d5224c5ea11005c89718f791 Parents: b52e639 Author: Cheng Lian Authored: Mon Jul 25 09:42:39 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 09:42:47 2016 -0700 -- .../expressions/windowExpressions.scala | 6 ++-- .../sqlgen/aggregate_functions_and_window.sql | 2 +- .../sqlgen/regular_expressions_and_window.sql | 2 +- .../test/resources/sqlgen/window_basic_1.sql| 2 +- .../test/resources/sqlgen/window_basic_2.sql| 2 +- .../catalyst/ExpressionSQLBuilderSuite.scala| 35 ++-- 6 files changed, 40 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/57d65e51/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index c0b453d..e35192c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -82,16 +82,16 @@ case class WindowSpecDefinition( val partition = if (partitionSpec.isEmpty) { "" } else { - "PARTITION BY " + partitionSpec.map(_.sql).mkString(", ") + "PARTITION BY " + partitionSpec.map(_.sql).mkString(", ") + " " } val order = if (orderSpec.isEmpty) { "" } else { - "ORDER BY " + orderSpec.map(_.sql).mkString(", ") + "ORDER BY " + orderSpec.map(_.sql).mkString(", ") + " " } -s"($partition $order ${frameSpecification.toString})" +s"($partition$order${frameSpecification.toString})" } } http://git-wip-us.apache.org/repos/asf/spark/blob/57d65e51/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql -- diff --git a/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql b/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql index 3a29bcf..c94f53b 100644 --- a/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql +++ b/sql/hive/src/test/resources/sqlgen/aggregate_functions_and_window.sql @@ -1,4 +1,4 @@ -- This file is automatically generated by LogicalPlanToSQLSuite. SELECT MAX(c) + COUNT(a) OVER () FROM parquet_t2 GROUP BY a, b -SELECT `gen_attr` AS `(max(c) + count(a) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))` FROM (SELECT (`gen_attr` + `gen_attr`) AS `gen_attr` FROM (SELECT gen_subquery_1.`gen_attr`, gen_subquery_1.`gen_attr`, count(`gen_attr`) OVER ( ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `gen_attr` FROM (SELECT max(`gen_attr`) AS `gen_attr`, `gen_attr` FROM (SELECT `a` AS `gen_attr`, `b` AS `gen_attr`, `c` AS `gen_attr`, `d` AS `gen_attr` FROM `default`.`parquet_t2`) AS gen_subquery_0 GROUP BY `gen_attr`, `gen_attr`) AS gen_subquery_1) AS gen_subquery_2) AS gen_subquery_3 +SELECT `gen_attr` AS `(max(c) + count(a) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING))` FROM (SELECT (`gen_attr` + `gen_attr`) AS `gen_attr` FROM (SELECT gen_subquery_1.`gen_attr`, gen_subquery_1.`gen_attr`, count(`gen_attr`) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS `gen_attr` FROM (SELECT max(`gen_attr`) AS `gen_attr`, `gen_attr` FROM (SELECT `a` AS `gen_attr`, `b` AS `gen_attr`, `c` AS `gen_attr`, `d` AS `gen_attr` FROM `default`.`parquet_t2`) AS gen_subquery_0 GROUP BY `gen_attr`, `ge
spark git commit: [SPARKR][DOCS] fix broken url in doc
Repository: spark Updated Branches: refs/heads/master 7ea6d282b -> b73defdd7 [SPARKR][DOCS] fix broken url in doc ## What changes were proposed in this pull request? Fix broken url, also, sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop" ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png) Data type section is in the middle of a list of gapply/gapplyCollect subsections: ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png) ## How was this patch tested? manual test Author: Felix Cheung Closes #14329 from felixcheung/rdoclinkfix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b73defdd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b73defdd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b73defdd Branch: refs/heads/master Commit: b73defdd790cb823a4f9958ca89cec06fd198051 Parents: 7ea6d28 Author: Felix Cheung Authored: Mon Jul 25 11:25:41 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jul 25 11:25:41 2016 -0700 -- R/pkg/R/DataFrame.R | 2 +- R/pkg/R/sparkR.R| 16 +++ docs/sparkr.md | 107 +++ 3 files changed, 62 insertions(+), 63 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b73defdd/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 2e99aa0..a473331 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -35,7 +35,7 @@ setOldClass("structType") #' @slot env An R environment that stores bookkeeping states of the SparkDataFrame #' @slot sdf A Java object reference to the backing Scala DataFrame #' @seealso \link{createDataFrame}, \link{read.json}, \link{table} -#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe} +#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes} #' @export #' @examples #'\dontrun{ http://git-wip-us.apache.org/repos/asf/spark/blob/b73defdd/R/pkg/R/sparkR.R -- diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R index ff5297f..524f7c4 100644 --- a/R/pkg/R/sparkR.R +++ b/R/pkg/R/sparkR.R @@ -28,14 +28,6 @@ connExists <- function(env) { }) } -#' @rdname sparkR.session.stop -#' @name sparkR.stop -#' @export -#' @note sparkR.stop since 1.4.0 -sparkR.stop <- function() { - sparkR.session.stop() -} - #' Stop the Spark Session and Spark Context #' #' Stop the Spark Session and Spark Context. @@ -90,6 +82,14 @@ sparkR.session.stop <- function() { clearJobjs() } +#' @rdname sparkR.session.stop +#' @name sparkR.stop +#' @export +#' @note sparkR.stop since 1.4.0 +sparkR.stop <- function() { + sparkR.session.stop() +} + #' (Deprecated) Initialize a new Spark Context #' #' This function initializes a new SparkContext. http://git-wip-us.apache.org/repos/asf/spark/blob/b73defdd/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index dfa5278..4bbc362 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -322,8 +322,59 @@ head(ldf, 3) Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to that key. The groups are chosen from `SparkDataFrame`s column(s). The output of function should be a `data.frame`. Schema specifies the row format of the resulting -`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below is the data type mapping between R -and Spark. +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark [data types](#data-type-mapping-between-r-and-spark). The column names of the returned `data.frame` are set by user. + + +{% highlight r %} + +# Determine six waiting times with the largest eruption time in minutes. +schema <- structType(structField("waiting", "double"), structField("max_eruption", "double")) +result <- gapply( +df, +"waiting", +function(key, x) { +y <- data.frame(key, max(x$eruptions)) +}, +schema) +head(collect(arrange(result, "max_eruption", decreasing = TRUE))) + +##waiting max_eruption +##1 64 5.100 +##2 69 5.067 +##3 71 5.033 +##4 87 5.000 +##5 63 4.933 +##6 89 4.900 +{% endhighlight %} + + +# gapplyCollect +Like `gapply`, applies a function to each partition of
spark git commit: [SPARKR][DOCS] fix broken url in doc
Repository: spark Updated Branches: refs/heads/branch-2.0 57d65e511 -> d9bd066b9 [SPARKR][DOCS] fix broken url in doc ## What changes were proposed in this pull request? Fix broken url, also, sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop" ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png) Data type section is in the middle of a list of gapply/gapplyCollect subsections: ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png) ## How was this patch tested? manual test Author: Felix Cheung Closes #14329 from felixcheung/rdoclinkfix. (cherry picked from commit b73defdd790cb823a4f9958ca89cec06fd198051) Signed-off-by: Shivaram Venkataraman Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d9bd066b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d9bd066b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d9bd066b Branch: refs/heads/branch-2.0 Commit: d9bd066b9f37cfd18037b9a600371d0342703c0f Parents: 57d65e5 Author: Felix Cheung Authored: Mon Jul 25 11:25:41 2016 -0700 Committer: Shivaram Venkataraman Committed: Mon Jul 25 11:25:51 2016 -0700 -- R/pkg/R/DataFrame.R | 2 +- R/pkg/R/sparkR.R| 16 +++ docs/sparkr.md | 107 +++ 3 files changed, 62 insertions(+), 63 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d9bd066b/R/pkg/R/DataFrame.R -- diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R index 92c10f1..aa211b3 100644 --- a/R/pkg/R/DataFrame.R +++ b/R/pkg/R/DataFrame.R @@ -35,7 +35,7 @@ setOldClass("structType") #' @slot env An R environment that stores bookkeeping states of the SparkDataFrame #' @slot sdf A Java object reference to the backing Scala DataFrame #' @seealso \link{createDataFrame}, \link{read.json}, \link{table} -#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe} +#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes} #' @export #' @examples #'\dontrun{ http://git-wip-us.apache.org/repos/asf/spark/blob/d9bd066b/R/pkg/R/sparkR.R -- diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R index ff5297f..524f7c4 100644 --- a/R/pkg/R/sparkR.R +++ b/R/pkg/R/sparkR.R @@ -28,14 +28,6 @@ connExists <- function(env) { }) } -#' @rdname sparkR.session.stop -#' @name sparkR.stop -#' @export -#' @note sparkR.stop since 1.4.0 -sparkR.stop <- function() { - sparkR.session.stop() -} - #' Stop the Spark Session and Spark Context #' #' Stop the Spark Session and Spark Context. @@ -90,6 +82,14 @@ sparkR.session.stop <- function() { clearJobjs() } +#' @rdname sparkR.session.stop +#' @name sparkR.stop +#' @export +#' @note sparkR.stop since 1.4.0 +sparkR.stop <- function() { + sparkR.session.stop() +} + #' (Deprecated) Initialize a new Spark Context #' #' This function initializes a new SparkContext. http://git-wip-us.apache.org/repos/asf/spark/blob/d9bd066b/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index dfa5278..4bbc362 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -322,8 +322,59 @@ head(ldf, 3) Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to that key. The groups are chosen from `SparkDataFrame`s column(s). The output of function should be a `data.frame`. Schema specifies the row format of the resulting -`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below is the data type mapping between R -and Spark. +`SparkDataFrame`. It must represent R function's output schema on the basis of Spark [data types](#data-type-mapping-between-r-and-spark). The column names of the returned `data.frame` are set by user. + + +{% highlight r %} + +# Determine six waiting times with the largest eruption time in minutes. +schema <- structType(structField("waiting", "double"), structField("max_eruption", "double")) +result <- gapply( +df, +"waiting", +function(key, x) { +y <- data.frame(key, max(x$eruptions)) +}, +schema) +head(collect(arrange(result, "max_eruption", decreasing = TRUE))) + +##waiting max_eruption +##1 64 5.100 +##2 69 5.067 +##3 71 5.033 +##4 87 5.000 +##5 63 4.933 +##6
spark git commit: [SPARK-16685] Remove audit-release scripts.
Repository: spark Updated Branches: refs/heads/master ad3708e78 -> dd784a882 [SPARK-16685] Remove audit-release scripts. ## What changes were proposed in this pull request? This patch removes dev/audit-release. It was initially created to do basic release auditing. They have been unused by for the last one year+. ## How was this patch tested? N/A Author: Reynold Xin Closes #14342 from rxin/SPARK-16685. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd784a88 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd784a88 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd784a88 Branch: refs/heads/master Commit: dd784a8822497ad0631208d56325c4d74ab9e036 Parents: ad3708e Author: Reynold Xin Authored: Mon Jul 25 20:03:54 2016 +0100 Committer: Sean Owen Committed: Mon Jul 25 20:03:54 2016 +0100 -- dev/audit-release/.gitignore| 2 - dev/audit-release/README.md | 12 - dev/audit-release/audit_release.py | 236 --- dev/audit-release/blank_maven_build/pom.xml | 43 dev/audit-release/blank_sbt_build/build.sbt | 30 --- dev/audit-release/maven_app_core/input.txt | 8 - dev/audit-release/maven_app_core/pom.xml| 52 .../maven_app_core/src/main/java/SimpleApp.java | 42 dev/audit-release/sbt_app_core/build.sbt| 28 --- dev/audit-release/sbt_app_core/input.txt| 8 - .../sbt_app_core/src/main/scala/SparkApp.scala | 63 - dev/audit-release/sbt_app_ganglia/build.sbt | 30 --- .../src/main/scala/SparkApp.scala | 41 dev/audit-release/sbt_app_graphx/build.sbt | 28 --- .../src/main/scala/GraphxApp.scala | 55 - dev/audit-release/sbt_app_hive/build.sbt| 29 --- dev/audit-release/sbt_app_hive/data.txt | 9 - .../sbt_app_hive/src/main/scala/HiveApp.scala | 59 - dev/audit-release/sbt_app_kinesis/build.sbt | 28 --- .../src/main/scala/SparkApp.scala | 35 --- dev/audit-release/sbt_app_sql/build.sbt | 28 --- .../sbt_app_sql/src/main/scala/SqlApp.scala | 61 - dev/audit-release/sbt_app_streaming/build.sbt | 28 --- .../src/main/scala/StreamingApp.scala | 65 - 24 files changed, 1020 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dd784a88/dev/audit-release/.gitignore -- diff --git a/dev/audit-release/.gitignore b/dev/audit-release/.gitignore deleted file mode 100644 index 7e057a9..000 --- a/dev/audit-release/.gitignore +++ /dev/null @@ -1,2 +0,0 @@ -project/ -spark_audit* http://git-wip-us.apache.org/repos/asf/spark/blob/dd784a88/dev/audit-release/README.md -- diff --git a/dev/audit-release/README.md b/dev/audit-release/README.md deleted file mode 100644 index 37b2a0a..000 --- a/dev/audit-release/README.md +++ /dev/null @@ -1,12 +0,0 @@ -Test Application Builds -=== - -This directory includes test applications which are built when auditing releases. You can run them locally by setting appropriate environment variables. - -``` -$ cd sbt_app_core -$ SCALA_VERSION=2.11.7 \ - SPARK_VERSION=1.0.0-SNAPSHOT \ - SPARK_RELEASE_REPOSITORY=file:///home/patrick/.ivy2/local \ - sbt run -``` http://git-wip-us.apache.org/repos/asf/spark/blob/dd784a88/dev/audit-release/audit_release.py -- diff --git a/dev/audit-release/audit_release.py b/dev/audit-release/audit_release.py deleted file mode 100755 index b28e7a4..000 --- a/dev/audit-release/audit_release.py +++ /dev/null @@ -1,236 +0,0 @@ -#!/usr/bin/python - -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -# Audits binary and maven artifacts for a Spark release. -# Requires GPG and Maven. -# usage: -# python audit_release.py - -import os -import re -import shutil -import subprocess -import sys -import time
spark git commit: [SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6
Repository: spark Updated Branches: refs/heads/master b73defdd7 -> ad3708e78 [SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6 ## What changes were proposed in this pull request? replace ANN convergence tolerance param default from 1e-4 to 1e-6 so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer. ## How was this patch tested? Existing Test. Author: WeichenXu Closes #14286 from WeichenXu123/update_ann_tol. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad3708e7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad3708e7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad3708e7 Branch: refs/heads/master Commit: ad3708e78377d631e3d586548c961f4748322bf0 Parents: b73defd Author: WeichenXu Authored: Mon Jul 25 20:00:37 2016 +0100 Committer: Sean Owen Committed: Mon Jul 25 20:00:37 2016 +0100 -- .../ml/classification/MultilayerPerceptronClassifier.scala | 4 ++-- python/pyspark/ml/classification.py | 8 2 files changed, 6 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ad3708e7/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala index 76ef32a..7264a99 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala @@ -100,7 +100,7 @@ private[classification] trait MultilayerPerceptronParams extends PredictorParams @Since("2.0.0") final def getInitialWeights: Vector = $(initialWeights) - setDefault(maxIter -> 100, tol -> 1e-4, blockSize -> 128, + setDefault(maxIter -> 100, tol -> 1e-6, blockSize -> 128, solver -> MultilayerPerceptronClassifier.LBFGS, stepSize -> 0.03) } @@ -190,7 +190,7 @@ class MultilayerPerceptronClassifier @Since("1.5.0") ( /** * Set the convergence tolerance of iterations. * Smaller value will lead to higher accuracy with the cost of more iterations. - * Default is 1E-4. + * Default is 1E-6. * * @group setParam */ http://git-wip-us.apache.org/repos/asf/spark/blob/ad3708e7/python/pyspark/ml/classification.py -- diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 613bc8c..9a3c7b1 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -1124,11 +1124,11 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, @keyword_only def __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", - maxIter=100, tol=1e-4, seed=None, layers=None, blockSize=128, stepSize=0.03, + maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, solver="l-bfgs", initialWeights=None): """ __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ - maxIter=100, tol=1e-4, seed=None, layers=None, blockSize=128, stepSize=0.03, \ + maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, \ solver="l-bfgs", initialWeights=None) """ super(MultilayerPerceptronClassifier, self).__init__() @@ -1141,11 +1141,11 @@ class MultilayerPerceptronClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, @keyword_only @since("1.6.0") def setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", - maxIter=100, tol=1e-4, seed=None, layers=None, blockSize=128, stepSize=0.03, + maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, solver="l-bfgs", initialWeights=None): """ setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ - maxIter=100, tol=1e-4, seed=None, layers=None, blockSize=128, stepSize=0.03, \ + maxIter=100, tol=1e-6, seed=None, layers=None, blockSize=128, stepSize=0.03, \ solver="l-bfgs", initialWeights=None) Sets params for MultilayerPerceptronClassifier. """ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additio
spark git commit: [SPARK-15271][MESOS] Allow force pulling executor docker images
Repository: spark Updated Branches: refs/heads/master dd784a882 -> 978cd5f12 [SPARK-15271][MESOS] Allow force pulling executor docker images ## What changes were proposed in this pull request? Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting `spark.mesos.executor.docker.forcePullImage`. Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). ## How was this patch tested? I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set. Author: Philipp Hoffmann Closes #13051 from philipphoffmann/force-pull-image. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/978cd5f1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/978cd5f1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/978cd5f1 Branch: refs/heads/master Commit: 978cd5f125eb5a410bad2e60bf8385b11cf1b978 Parents: dd784a8 Author: Philipp Hoffmann Authored: Mon Jul 25 20:14:47 2016 +0100 Committer: Sean Owen Committed: Mon Jul 25 20:14:47 2016 +0100 -- .../cluster/mesos/MesosClusterScheduler.scala | 14 ++--- .../MesosCoarseGrainedSchedulerBackend.scala| 7 ++- .../MesosFineGrainedSchedulerBackend.scala | 7 ++- .../mesos/MesosSchedulerBackendUtil.scala | 20 --- ...esosCoarseGrainedSchedulerBackendSuite.scala | 63 .../MesosFineGrainedSchedulerBackendSuite.scala | 2 + dev/deps/spark-deps-hadoop-2.2 | 2 +- dev/deps/spark-deps-hadoop-2.3 | 2 +- dev/deps/spark-deps-hadoop-2.4 | 2 +- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- docs/_config.yml| 2 +- docs/running-on-mesos.md| 12 pom.xml | 2 +- 14 files changed, 110 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/978cd5f1/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala index 39b0f4d..1e9644d 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala @@ -537,16 +537,10 @@ private[spark] class MesosClusterScheduler( .addAllResources(memResourcesToUse.asJava) offer.resources = finalResources.asJava submission.schedulerProperties.get("spark.mesos.executor.docker.image").foreach { image => - val container = taskInfo.getContainerBuilder() - val volumes = submission.schedulerProperties -.get("spark.mesos.executor.docker.volumes") -.map(MesosSchedulerBackendUtil.parseVolumesSpec) - val portmaps = submission.schedulerProperties -.get("spark.mesos.executor.docker.portmaps") -.map(MesosSchedulerBackendUtil.parsePortMappingsSpec) - MesosSchedulerBackendUtil.addDockerInfo( -container, image, volumes = volumes, portmaps = portmaps) - taskInfo.setContainer(container.build()) + MesosSchedulerBackendUtil.setupContainerBuilderDockerInfo( +image, +submission.schedulerProperties.get, +taskInfo.getContainerBuilder()) } val queuedTasks = tasks.getOrElseUpdate(offer.offerId, new ArrayBuffer[TaskInfo]) queuedTasks += taskInfo.build() http://git-wip-us.apache.org/repos/asf/spark/blob/978cd5f1/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala index 99e6d39..52993ca 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend
spark git commit: [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc
Repository: spark Updated Branches: refs/heads/master 978cd5f12 -> 3b6e1d094 [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc ## What changes were proposed in this pull request? Fixed several inline formatting in ml features doc. Before: https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png";> After: https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png";> ## How was this patch tested? Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser. Author: Shuai Lin Closes #14194 from lins05/fix-docs-formatting. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3b6e1d09 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3b6e1d09 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3b6e1d09 Branch: refs/heads/master Commit: 3b6e1d094e153599e158331b10d33d74a667be5a Parents: 978cd5f Author: Shuai Lin Authored: Mon Jul 25 20:26:55 2016 +0100 Committer: Sean Owen Committed: Mon Jul 25 20:26:55 2016 +0100 -- docs/ml-features.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3b6e1d09/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index e7d7ddf..6020114 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -216,7 +216,7 @@ for more details on the API. [RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) allows more advanced tokenization based on regular expression (regex) matching. - By default, the parameter "pattern" (regex, default: \\s+) is used as delimiters to split the input text. + By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as delimiters to split the input text. Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes "tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result. @@ -815,7 +815,7 @@ The rescaled value for a feature E is calculated as, `\begin{equation} Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min \end{equation}` -For the case `E_{max} == E_{min}`, `Rescaled(e_i) = 0.5 * (max + min)` +For the case `$E_{max} == E_{min}$`, `$Rescaled(e_i) = 0.5 * (max + min)$` Note that since zero values will probably be transformed to non-zero values, output of the transformer will be `DenseVector` even for sparse input. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc
Repository: spark Updated Branches: refs/heads/branch-2.0 d9bd066b9 -> f0d05f669 [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features doc ## What changes were proposed in this pull request? Fixed several inline formatting in ml features doc. Before: https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png";> After: https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png";> ## How was this patch tested? Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser. Author: Shuai Lin Closes #14194 from lins05/fix-docs-formatting. (cherry picked from commit 3b6e1d094e153599e158331b10d33d74a667be5a) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f0d05f66 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f0d05f66 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f0d05f66 Branch: refs/heads/branch-2.0 Commit: f0d05f669b4e7be017d8d0cfba33c3a61a1eef8f Parents: d9bd066 Author: Shuai Lin Authored: Mon Jul 25 20:26:55 2016 +0100 Committer: Sean Owen Committed: Mon Jul 25 20:27:04 2016 +0100 -- docs/ml-features.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f0d05f66/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index e7d7ddf..6020114 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -216,7 +216,7 @@ for more details on the API. [RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) allows more advanced tokenization based on regular expression (regex) matching. - By default, the parameter "pattern" (regex, default: \\s+) is used as delimiters to split the input text. + By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as delimiters to split the input text. Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes "tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result. @@ -815,7 +815,7 @@ The rescaled value for a feature E is calculated as, `\begin{equation} Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min \end{equation}` -For the case `E_{max} == E_{min}`, `Rescaled(e_i) = 0.5 * (max + min)` +For the case `$E_{max} == E_{min}$`, `$Rescaled(e_i) = 0.5 * (max + min)$` Note that since zero values will probably be transformed to non-zero values, output of the transformer will be `DenseVector` even for sparse input. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"
Repository: spark Updated Branches: refs/heads/master 3b6e1d094 -> fc17121d5 Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images" This reverts commit 978cd5f125eb5a410bad2e60bf8385b11cf1b978. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc17121d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc17121d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc17121d Branch: refs/heads/master Commit: fc17121d592acbd7405135cd576bafc5c574650e Parents: 3b6e1d0 Author: Josh Rosen Authored: Mon Jul 25 12:43:44 2016 -0700 Committer: Josh Rosen Committed: Mon Jul 25 12:43:44 2016 -0700 -- .../cluster/mesos/MesosClusterScheduler.scala | 14 +++-- .../MesosCoarseGrainedSchedulerBackend.scala| 7 +-- .../MesosFineGrainedSchedulerBackend.scala | 7 +-- .../mesos/MesosSchedulerBackendUtil.scala | 20 +++ ...esosCoarseGrainedSchedulerBackendSuite.scala | 63 .../MesosFineGrainedSchedulerBackendSuite.scala | 2 - dev/deps/spark-deps-hadoop-2.2 | 2 +- dev/deps/spark-deps-hadoop-2.3 | 2 +- dev/deps/spark-deps-hadoop-2.4 | 2 +- dev/deps/spark-deps-hadoop-2.6 | 2 +- dev/deps/spark-deps-hadoop-2.7 | 2 +- docs/_config.yml| 2 +- docs/running-on-mesos.md| 12 pom.xml | 2 +- 14 files changed, 29 insertions(+), 110 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fc17121d/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala index 1e9644d..39b0f4d 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala @@ -537,10 +537,16 @@ private[spark] class MesosClusterScheduler( .addAllResources(memResourcesToUse.asJava) offer.resources = finalResources.asJava submission.schedulerProperties.get("spark.mesos.executor.docker.image").foreach { image => - MesosSchedulerBackendUtil.setupContainerBuilderDockerInfo( -image, -submission.schedulerProperties.get, -taskInfo.getContainerBuilder()) + val container = taskInfo.getContainerBuilder() + val volumes = submission.schedulerProperties +.get("spark.mesos.executor.docker.volumes") +.map(MesosSchedulerBackendUtil.parseVolumesSpec) + val portmaps = submission.schedulerProperties +.get("spark.mesos.executor.docker.portmaps") +.map(MesosSchedulerBackendUtil.parsePortMappingsSpec) + MesosSchedulerBackendUtil.addDockerInfo( +container, image, volumes = volumes, portmaps = portmaps) + taskInfo.setContainer(container.build()) } val queuedTasks = tasks.getOrElseUpdate(offer.offerId, new ArrayBuffer[TaskInfo]) queuedTasks += taskInfo.build() http://git-wip-us.apache.org/repos/asf/spark/blob/fc17121d/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala index 52993ca..99e6d39 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala @@ -408,11 +408,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend( .addAllResources(memResourcesToUse.asJava) sc.conf.getOption("spark.mesos.executor.docker.image").foreach { image => -MesosSchedulerBackendUtil.setupContainerBuilderDockerInfo( - image, - sc.conf.getOption, - taskBuilder.getContainerBuilder -) +MesosSchedulerBackendUtil + .setupContainerBuilderDockerInfo(image, sc.conf, taskBuilder.getContainerBuilder) } tasks(offer.getId) ::= taskBuilder.build() http://git-wip-us.apache.org/repos/asf/spark/blob/fc17121d/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosFine
spark git commit: [SQL][DOC] Fix a default name for parquet compression
Repository: spark Updated Branches: refs/heads/master fc17121d5 -> cda4603de [SQL][DOC] Fix a default name for parquet compression ## What changes were proposed in this pull request? This pr is to fix a wrong description for parquet default compression. Author: Takeshi YAMAMURO Closes #14351 from maropu/FixParquetDoc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cda4603d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cda4603d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cda4603d Branch: refs/heads/master Commit: cda4603de340d533c49feac1b244ddfd291f9bcf Parents: fc17121 Author: Takeshi YAMAMURO Authored: Mon Jul 25 15:08:58 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 15:08:58 2016 -0700 -- docs/sql-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cda4603d/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index ad123d7..d8c8698 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -749,7 +749,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession spark.sql.parquet.compression.codec - gzip + snappy Sets the compression codec use when writing Parquet files. Acceptable values include: uncompressed, snappy, gzip, lzo. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SQL][DOC] Fix a default name for parquet compression
Repository: spark Updated Branches: refs/heads/branch-2.0 f0d05f669 -> 1b4f7cf13 [SQL][DOC] Fix a default name for parquet compression ## What changes were proposed in this pull request? This pr is to fix a wrong description for parquet default compression. Author: Takeshi YAMAMURO Closes #14351 from maropu/FixParquetDoc. (cherry picked from commit cda4603de340d533c49feac1b244ddfd291f9bcf) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b4f7cf1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b4f7cf1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b4f7cf1 Branch: refs/heads/branch-2.0 Commit: 1b4f7cf135eebc46f07649509a027b6d422dcfdf Parents: f0d05f6 Author: Takeshi YAMAMURO Authored: Mon Jul 25 15:08:58 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 15:09:04 2016 -0700 -- docs/sql-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1b4f7cf1/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index e92596b..33b170e 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -749,7 +749,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession spark.sql.parquet.compression.codec - gzip + snappy Sets the compression codec use when writing Parquet files. Acceptable values include: uncompressed, snappy, gzip, lzo. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16166][CORE] Also take off-heap memory usage into consideration in log and webui display
Repository: spark Updated Branches: refs/heads/master cda4603de -> f5ea7fe53 [SPARK-16166][CORE] Also take off-heap memory usage into consideration in log and webui display ## What changes were proposed in this pull request? Currently in the log and UI display, only on-heap storage memory is calculated and displayed, ``` 16/06/27 13:41:52 INFO MemoryStore: Block rdd_5_0 stored as values in memory (estimated size 17.8 KB, free 665.9 MB) ``` https://cloud.githubusercontent.com/assets/850797/16369960/53fb614e-3c6e-11e6-8fa3-7ffe65abcb49.png";> With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992) off-heap memory is supported for data persistence, so here change to also take off-heap storage memory into consideration. ## How was this patch tested? Unit test and local verification. Author: jerryshao Closes #13920 from jerryshao/SPARK-16166. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5ea7fe5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5ea7fe5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5ea7fe5 Branch: refs/heads/master Commit: f5ea7fe53974a7e8cbfc222b9a6f47669b53ccfd Parents: cda4603 Author: jerryshao Authored: Mon Jul 25 15:17:06 2016 -0700 Committer: Josh Rosen Committed: Mon Jul 25 15:17:06 2016 -0700 -- .../scala/org/apache/spark/memory/MemoryManager.scala | 10 -- .../org/apache/spark/memory/StaticMemoryManager.scala | 2 ++ .../org/apache/spark/memory/UnifiedMemoryManager.scala| 4 .../scala/org/apache/spark/storage/BlockManager.scala | 5 +++-- .../org/apache/spark/storage/memory/MemoryStore.scala | 4 +++- .../scala/org/apache/spark/memory/TestMemoryManager.scala | 2 ++ .../org/apache/spark/storage/BlockManagerSuite.scala | 8 7 files changed, 26 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea7fe5/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala -- diff --git a/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala b/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala index 0210217..82442cf 100644 --- a/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala +++ b/core/src/main/scala/org/apache/spark/memory/MemoryManager.scala @@ -62,13 +62,19 @@ private[spark] abstract class MemoryManager( offHeapStorageMemoryPool.incrementPoolSize(offHeapStorageMemory) /** - * Total available memory for storage, in bytes. This amount can vary over time, depending on - * the MemoryManager implementation. + * Total available on heap memory for storage, in bytes. This amount can vary over time, + * depending on the MemoryManager implementation. * In this model, this is equivalent to the amount of memory not occupied by execution. */ def maxOnHeapStorageMemory: Long /** + * Total available off heap memory for storage, in bytes. This amount can vary over time, + * depending on the MemoryManager implementation. + */ + def maxOffHeapStorageMemory: Long + + /** * Set the [[MemoryStore]] used by this manager to evict cached blocks. * This must be set after construction due to initialization ordering constraints. */ http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea7fe5/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala -- diff --git a/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala b/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala index 08155aa..a6f7db0 100644 --- a/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala +++ b/core/src/main/scala/org/apache/spark/memory/StaticMemoryManager.scala @@ -55,6 +55,8 @@ private[spark] class StaticMemoryManager( (maxOnHeapStorageMemory * conf.getDouble("spark.storage.unrollFraction", 0.2)).toLong } + override def maxOffHeapStorageMemory: Long = 0L + override def acquireStorageMemory( blockId: BlockId, numBytes: Long, http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea7fe5/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala -- diff --git a/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala b/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala index c7b36be..fea2808 100644 --- a/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala +++ b/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala @@ -67,6 +67,10 @@ private[spark] class UnifiedMemoryManager private[memory] ( maxHeapMemory - onHeapExecutionMemoryPo
spark git commit: [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"
Repository: spark Updated Branches: refs/heads/master f5ea7fe53 -> 12f490b5c [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash" ## What changes were proposed in this pull request? SubexpressionEliminationSuite."Semantic equals and hash" assumes the default AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when this test runs. It may happen to use "ExprId(1)". This PR detects the conflict and makes sure we create a different ExprId when the conflict happens. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu Closes #14350 from zsxwing/SPARK-16715. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/12f490b5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/12f490b5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/12f490b5 Branch: refs/heads/master Commit: 12f490b5c85cdee26d47eb70ad1a1edd00504f21 Parents: f5ea7fe Author: Shixiong Zhu Authored: Mon Jul 25 16:08:29 2016 -0700 Committer: Shixiong Zhu Committed: Mon Jul 25 16:08:29 2016 -0700 -- .../catalyst/expressions/SubexpressionEliminationSuite.scala | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/12f490b5/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala index 90e97d7..1e39b24 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala @@ -21,8 +21,12 @@ import org.apache.spark.sql.types.IntegerType class SubexpressionEliminationSuite extends SparkFunSuite { test("Semantic equals and hash") { -val id = ExprId(1) val a: AttributeReference = AttributeReference("name", IntegerType)() +val id = { + // Make sure we use a "ExprId" different from "a.exprId" + val _id = ExprId(1) + if (a.exprId == _id) ExprId(2) else _id +} val b1 = a.withName("name2").withExprId(id) val b2 = a.withExprId(id) val b3 = a.withQualifier(Some("qualifierName")) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"
Repository: spark Updated Branches: refs/heads/branch-2.0 1b4f7cf13 -> 41e72f659 [SPARK-16715][TESTS] Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash" ## What changes were proposed in this pull request? SubexpressionEliminationSuite."Semantic equals and hash" assumes the default AttributeReference's exprId wont' be "ExprId(1)". However, that depends on when this test runs. It may happen to use "ExprId(1)". This PR detects the conflict and makes sure we create a different ExprId when the conflict happens. ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu Closes #14350 from zsxwing/SPARK-16715. (cherry picked from commit 12f490b5c85cdee26d47eb70ad1a1edd00504f21) Signed-off-by: Shixiong Zhu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41e72f65 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41e72f65 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41e72f65 Branch: refs/heads/branch-2.0 Commit: 41e72f65929c345aa21ebd4e00dadfbfb5acfdf3 Parents: 1b4f7cf Author: Shixiong Zhu Authored: Mon Jul 25 16:08:29 2016 -0700 Committer: Shixiong Zhu Committed: Mon Jul 25 16:08:36 2016 -0700 -- .../catalyst/expressions/SubexpressionEliminationSuite.scala | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/41e72f65/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala index 90e97d7..1e39b24 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SubexpressionEliminationSuite.scala @@ -21,8 +21,12 @@ import org.apache.spark.sql.types.IntegerType class SubexpressionEliminationSuite extends SparkFunSuite { test("Semantic equals and hash") { -val id = ExprId(1) val a: AttributeReference = AttributeReference("name", IntegerType)() +val id = { + // Make sure we use a "ExprId" different from "a.exprId" + val _id = ExprId(1) + if (a.exprId == _id) ExprId(2) else _id +} val b1 = a.withName("name2").withExprId(id) val b2 = a.withExprId(id) val b3 = a.withQualifier(Some("qualifierName")) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog
Repository: spark Updated Branches: refs/heads/master 12f490b5c -> c979c8bba [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog ## What changes were proposed in this pull request? Current fix for deadlock disables interrupts in the StreamExecution which getting offsets for all sources, and when writing to any metadata log, to avoid potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, disabling interrupts can have unintended consequences in other sources. So I am making the fix more narrow, by disabling interrupt it only in the HDFSMetadataLog. This is a narrower fix for something risky like disabling interrupt. ## How was this patch tested? Existing tests. Author: Tathagata Das Closes #14292 from tdas/SPARK-14131. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c979c8bb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c979c8bb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c979c8bb Branch: refs/heads/master Commit: c979c8bba02bc89cb9ad81b212f085a8a5490a07 Parents: 12f490b Author: Tathagata Das Authored: Mon Jul 25 16:09:22 2016 -0700 Committer: Tathagata Das Committed: Mon Jul 25 16:09:22 2016 -0700 -- .../execution/streaming/HDFSMetadataLog.scala | 31 ++ .../execution/streaming/StreamExecution.scala | 28 - .../streaming/FileStreamSinkLogSuite.scala | 4 +- .../streaming/HDFSMetadataLogSuite.scala| 10 +++-- .../apache/spark/sql/test/SQLTestUtils.scala| 43 +++- 5 files changed, 80 insertions(+), 36 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c979c8bb/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala index 069e41b..698f07b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala @@ -32,6 +32,7 @@ import org.apache.spark.internal.Logging import org.apache.spark.network.util.JavaUtils import org.apache.spark.serializer.JavaSerializer import org.apache.spark.sql.SparkSession +import org.apache.spark.util.UninterruptibleThread /** @@ -91,18 +92,30 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String) serializer.deserialize[T](ByteBuffer.wrap(bytes)) } + /** + * Store the metadata for the specified batchId and return `true` if successful. If the batchId's + * metadata has already been stored, this method will return `false`. + * + * Note that this method must be called on a [[org.apache.spark.util.UninterruptibleThread]] + * so that interrupts can be disabled while writing the batch file. This is because there is a + * potential dead-lock in Hadoop "Shell.runCommand" before 2.5.0 (HADOOP-10622). If the thread + * running "Shell.runCommand" is interrupted, then the thread can get deadlocked. In our + * case, `writeBatch` creates a file using HDFS API and calls "Shell.runCommand" to set the + * file permissions, and can get deadlocked if the stream execution thread is stopped by + * interrupt. Hence, we make sure that this method is called on [[UninterruptibleThread]] which + * allows us to disable interrupts here. Also see SPARK-14131. + */ override def add(batchId: Long, metadata: T): Boolean = { get(batchId).map(_ => false).getOrElse { - // Only write metadata when the batch has not yet been written. - try { -writeBatch(batchId, serialize(metadata)) -true - } catch { -case e: IOException if "java.lang.InterruptedException" == e.getMessage => - // create may convert InterruptedException to IOException. Let's convert it back to - // InterruptedException so that this failure won't crash StreamExecution - throw new InterruptedException("Creating file is interrupted") + // Only write metadata when the batch has not yet been written + Thread.currentThread match { +case ut: UninterruptibleThread => + ut.runUninterruptibly { writeBatch(batchId, serialize(metadata)) } +case _ => + throw new IllegalStateException( +"HDFSMetadataLog.add() must be executed on a o.a.spark.util.UninterruptibleThread") } + true } } http://git-wip-us.apache.org/repos/asf/spark/blob/c979c8bb/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala --
spark git commit: [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog
Repository: spark Updated Branches: refs/heads/branch-2.0 41e72f659 -> b17fe4e41 [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks in HDFSMetadataLog ## What changes were proposed in this pull request? Current fix for deadlock disables interrupts in the StreamExecution which getting offsets for all sources, and when writing to any metadata log, to avoid potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, disabling interrupts can have unintended consequences in other sources. So I am making the fix more narrow, by disabling interrupt it only in the HDFSMetadataLog. This is a narrower fix for something risky like disabling interrupt. ## How was this patch tested? Existing tests. Author: Tathagata Das Closes #14292 from tdas/SPARK-14131. (cherry picked from commit c979c8bba02bc89cb9ad81b212f085a8a5490a07) Signed-off-by: Tathagata Das Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b17fe4e4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b17fe4e4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b17fe4e4 Branch: refs/heads/branch-2.0 Commit: b17fe4e412d27a4f3e8ad86ac5d8c2c108654eb3 Parents: 41e72f6 Author: Tathagata Das Authored: Mon Jul 25 16:09:22 2016 -0700 Committer: Tathagata Das Committed: Mon Jul 25 16:09:35 2016 -0700 -- .../execution/streaming/HDFSMetadataLog.scala | 31 ++ .../execution/streaming/StreamExecution.scala | 28 - .../streaming/FileStreamSinkLogSuite.scala | 4 +- .../streaming/HDFSMetadataLogSuite.scala| 10 +++-- .../apache/spark/sql/test/SQLTestUtils.scala| 43 +++- 5 files changed, 80 insertions(+), 36 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b17fe4e4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala index 069e41b..698f07b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala @@ -32,6 +32,7 @@ import org.apache.spark.internal.Logging import org.apache.spark.network.util.JavaUtils import org.apache.spark.serializer.JavaSerializer import org.apache.spark.sql.SparkSession +import org.apache.spark.util.UninterruptibleThread /** @@ -91,18 +92,30 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: SparkSession, path: String) serializer.deserialize[T](ByteBuffer.wrap(bytes)) } + /** + * Store the metadata for the specified batchId and return `true` if successful. If the batchId's + * metadata has already been stored, this method will return `false`. + * + * Note that this method must be called on a [[org.apache.spark.util.UninterruptibleThread]] + * so that interrupts can be disabled while writing the batch file. This is because there is a + * potential dead-lock in Hadoop "Shell.runCommand" before 2.5.0 (HADOOP-10622). If the thread + * running "Shell.runCommand" is interrupted, then the thread can get deadlocked. In our + * case, `writeBatch` creates a file using HDFS API and calls "Shell.runCommand" to set the + * file permissions, and can get deadlocked if the stream execution thread is stopped by + * interrupt. Hence, we make sure that this method is called on [[UninterruptibleThread]] which + * allows us to disable interrupts here. Also see SPARK-14131. + */ override def add(batchId: Long, metadata: T): Boolean = { get(batchId).map(_ => false).getOrElse { - // Only write metadata when the batch has not yet been written. - try { -writeBatch(batchId, serialize(metadata)) -true - } catch { -case e: IOException if "java.lang.InterruptedException" == e.getMessage => - // create may convert InterruptedException to IOException. Let's convert it back to - // InterruptedException so that this failure won't crash StreamExecution - throw new InterruptedException("Creating file is interrupted") + // Only write metadata when the batch has not yet been written + Thread.currentThread match { +case ut: UninterruptibleThread => + ut.runUninterruptibly { writeBatch(batchId, serialize(metadata)) } +case _ => + throw new IllegalStateException( +"HDFSMetadataLog.add() must be executed on a o.a.spark.util.UninterruptibleThread") } + true } } http://git-wip-us.apache.org/repos/asf/spark/blob/b1
spark git commit: [SPARK-15590][WEBUI] Paginate Job Table in Jobs tab
Repository: spark Updated Branches: refs/heads/master c979c8bba -> db36e1e75 [SPARK-15590][WEBUI] Paginate Job Table in Jobs tab ## What changes were proposed in this pull request? This patch adds pagination support for the Job Tables in the Jobs tab. Pagination is provided for all of the three Job Tables (active, completed, and failed). Interactions (jumping, sorting, and setting page size) for paged tables are also included. The diff didn't keep track of some lines based on the original ones. The function `makeRow`of the original `AllJobsPage.scala` is reused. They are separated at the beginning of the function `jobRow` (L427-439) and the function `row`(L594-618) in the new `AllJobsPage.scala`. ## How was this patch tested? Tested manually by using checking the Web UI after completing and failing hundreds of jobs. Generate completed jobs by: ```scala val d = sc.parallelize(Array(1,2,3,4,5)) for(i <- 1 to 255){ var b = d.collect() } ``` Generate failed jobs by calling the following code multiple times: ```scala var b = d.map(_/0).collect() ``` Interactions like jumping, sorting, and setting page size are all tested. This shows the pagination for completed jobs: ![paginate success jobs](https://cloud.githubusercontent.com/assets/5558370/15986498/efa12ef6-303b-11e6-8b1d-c3382aeb9ad0.png) This shows the sorting works in job tables: ![sorting](https://cloud.githubusercontent.com/assets/5558370/15986539/98c8a81a-303c-11e6-86f2-8d2bc7924ee9.png) This shows the pagination for failed jobs and the effect of jumping and setting page size: ![paginate failed jobs](https://cloud.githubusercontent.com/assets/5558370/15986556/d8c1323e-303c-11e6-8e4b-7bdb030ea42b.png) Author: Tao Lin Closes #13620 from nblintao/dev. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db36e1e7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db36e1e7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db36e1e7 Branch: refs/heads/master Commit: db36e1e75d69d63b76312e85ae3a6c95cebbe65e Parents: c979c8b Author: Tao Lin Authored: Mon Jul 25 17:35:50 2016 -0700 Committer: Shixiong Zhu Committed: Mon Jul 25 17:35:50 2016 -0700 -- .../org/apache/spark/ui/jobs/AllJobsPage.scala | 369 --- .../org/apache/spark/ui/UISeleniumSuite.scala | 5 +- 2 files changed, 312 insertions(+), 62 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/db36e1e7/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala b/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala index 035d706..e5363ce 100644 --- a/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala +++ b/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala @@ -17,17 +17,21 @@ package org.apache.spark.ui.jobs +import java.net.URLEncoder import java.util.Date import javax.servlet.http.HttpServletRequest +import scala.collection.JavaConverters._ import scala.collection.mutable.{HashMap, ListBuffer} import scala.xml._ import org.apache.commons.lang3.StringEscapeUtils import org.apache.spark.JobExecutionStatus -import org.apache.spark.ui.{ToolTips, UIUtils, WebUIPage} -import org.apache.spark.ui.jobs.UIData.{ExecutorUIData, JobUIData} +import org.apache.spark.scheduler.StageInfo +import org.apache.spark.ui._ +import org.apache.spark.ui.jobs.UIData.{ExecutorUIData, JobUIData, StageUIData} +import org.apache.spark.util.Utils /** Page showing list of all ongoing and recently finished jobs */ private[ui] class AllJobsPage(parent: JobsTab) extends WebUIPage("") { @@ -210,64 +214,69 @@ private[ui] class AllJobsPage(parent: JobsTab) extends WebUIPage("") { } - private def jobsTable(jobs: Seq[JobUIData]): Seq[Node] = { + private def jobsTable( + request: HttpServletRequest, + jobTag: String, + jobs: Seq[JobUIData]): Seq[Node] = { +val allParameters = request.getParameterMap.asScala.toMap +val parameterOtherTable = allParameters.filterNot(_._1.startsWith(jobTag)) + .map(para => para._1 + "=" + para._2(0)) + val someJobHasJobGroup = jobs.exists(_.jobGroup.isDefined) +val jobIdTitle = if (someJobHasJobGroup) "Job Id (Job Group)" else "Job Id" -val columns: Seq[Node] = { - {if (someJobHasJobGroup) "Job Id (Job Group)" else "Job Id"} - Description - Submitted - Duration - Stages: Succeeded/Total - Tasks (for all stages): Succeeded/Total -} +val parameterJobPage = request.getParameter(jobTag + ".page") +val parameterJobSortColumn = request.getParameter(jobTag + ".sort") +val parameterJobSortDesc = request.getParameter(jobTag + ".desc") +val pa
spark git commit: [SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when eventually fails
Repository: spark Updated Branches: refs/heads/branch-2.0 b17fe4e41 -> 9d581dc61 [SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when eventually fails ## What changes were proposed in this pull request? This PR moves `ssc.stop()` into `finally` for `StreamingContextSuite.createValidCheckpoint` to avoid leaking a StreamingContext since leaking a StreamingContext will fail a lot of tests and make us hard to find the real failure one. ## How was this patch tested? Jenkins unit tests Author: Shixiong Zhu Closes #14354 from zsxwing/ssc-leak. (cherry picked from commit e164a04b2ba3503e5c14cd1cd4beb40e0b79925a) Signed-off-by: Tathagata Das Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9d581dc6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9d581dc6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9d581dc6 Branch: refs/heads/branch-2.0 Commit: 9d581dc61951eccf0f06868e0d3f10134f433e82 Parents: b17fe4e Author: Shixiong Zhu Authored: Mon Jul 25 18:26:29 2016 -0700 Committer: Tathagata Das Committed: Mon Jul 25 18:26:37 2016 -0700 -- .../org/apache/spark/streaming/StreamingContextSuite.scala | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9d581dc6/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala -- diff --git a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala index 806e181..f1482e5 100644 --- a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala +++ b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala @@ -819,10 +819,13 @@ class StreamingContextSuite extends SparkFunSuite with BeforeAndAfter with Timeo ssc.checkpoint(checkpointDirectory) ssc.textFileStream(testDirectory).foreachRDD { rdd => rdd.count() } ssc.start() -eventually(timeout(1 millis)) { - assert(Checkpoint.getCheckpointFiles(checkpointDirectory).size > 1) +try { + eventually(timeout(3 millis)) { +assert(Checkpoint.getCheckpointFiles(checkpointDirectory).size > 1) + } +} finally { + ssc.stop() } -ssc.stop() checkpointDirectory } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when eventually fails
Repository: spark Updated Branches: refs/heads/master db36e1e75 -> e164a04b2 [SPARK-16722][TESTS] Fix a StreamingContext leak in StreamingContextSuite when eventually fails ## What changes were proposed in this pull request? This PR moves `ssc.stop()` into `finally` for `StreamingContextSuite.createValidCheckpoint` to avoid leaking a StreamingContext since leaking a StreamingContext will fail a lot of tests and make us hard to find the real failure one. ## How was this patch tested? Jenkins unit tests Author: Shixiong Zhu Closes #14354 from zsxwing/ssc-leak. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e164a04b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e164a04b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e164a04b Branch: refs/heads/master Commit: e164a04b2ba3503e5c14cd1cd4beb40e0b79925a Parents: db36e1e Author: Shixiong Zhu Authored: Mon Jul 25 18:26:29 2016 -0700 Committer: Tathagata Das Committed: Mon Jul 25 18:26:29 2016 -0700 -- .../org/apache/spark/streaming/StreamingContextSuite.scala | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e164a04b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala -- diff --git a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala index 806e181..f1482e5 100644 --- a/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala +++ b/streaming/src/test/scala/org/apache/spark/streaming/StreamingContextSuite.scala @@ -819,10 +819,13 @@ class StreamingContextSuite extends SparkFunSuite with BeforeAndAfter with Timeo ssc.checkpoint(checkpointDirectory) ssc.textFileStream(testDirectory).foreachRDD { rdd => rdd.count() } ssc.start() -eventually(timeout(1 millis)) { - assert(Checkpoint.getCheckpointFiles(checkpointDirectory).size > 1) +try { + eventually(timeout(3 millis)) { +assert(Checkpoint.getCheckpointFiles(checkpointDirectory).size > 1) + } +} finally { + ssc.stop() } -ssc.stop() checkpointDirectory } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs
Repository: spark Updated Branches: refs/heads/master e164a04b2 -> 3fc456694 [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugs ## What changes were proposed in this pull request? **Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)** When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example, ``` hive> CREATE TABLE tab1 (id int); OK Time taken: 0.196 seconds hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1; FAILED: SemanticException [Error 10218]: Existing table is not a view The following is an existing table, not a view: default.tab1 hive> ALTER VIEW tab1 AS SELECT * FROM t1; FAILED: SemanticException [Error 10218]: Existing table is not a view The following is an existing table, not a view: default.tab1 hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1; OK Time taken: 0.678 seconds ``` **Issue 2: Strange Error when Issuing Load Table Against A View** Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example, ```SQL LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName ``` ``` java.lang.reflect.InvocationTargetException was thrown. java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680) ``` ## How was this patch tested? Added test cases Author: gatorsmile Closes #14314 from gatorsmile/tableDDLAgainstView. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3fc45669 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3fc45669 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3fc45669 Branch: refs/heads/master Commit: 3fc456694151e766c551b4bc58ed7c945666 Parents: e164a04 Author: gatorsmile Authored: Tue Jul 26 09:32:29 2016 +0800 Committer: Wenchen Fan Committed: Tue Jul 26 09:32:29 2016 +0800 -- .../sql/catalyst/catalog/SessionCatalog.scala | 6 +- .../spark/sql/execution/command/tables.scala| 33 - .../spark/sql/execution/command/views.scala | 5 ++ .../spark/sql/hive/execution/SQLViewSuite.scala | 71 +++- 4 files changed, 96 insertions(+), 19 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3fc45669/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala index 134fc4e..1856dc4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala @@ -443,12 +443,12 @@ class SessionCatalog( } /** - * Return whether a table with the specified name exists. + * Return whether a table/view with the specified name exists. * - * Note: If a database is explicitly specified, then this will return whether the table + * Note: If a database is explicitly specified, then this will return whether the table/view * exists in that particular database instead. In that case, even if there is a temporary * table with the same name, we will return false if the specified database does not - * contain the table. + * contain the table/view. */ def tableExists(name: TableIdentifier): Boolean = synchronized { val db = formatDatabaseName(name.database.getOrElse(currentDb)) http://git-wip-us.apache.org/repos/asf/spark/blob/3fc45669/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala index 8f3adad..c6daa95 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala @@ -202,35 +202,38 @@ case class LoadDataCommand( override def run(sparkSession: SparkSession): Seq[Row] = { val ca
spark git commit: Fix description of spark.speculation.quantile
Repository: spark Updated Branches: refs/heads/branch-2.0 9d581dc61 -> 3d3547487 Fix description of spark.speculation.quantile ## What changes were proposed in this pull request? Minor doc fix regarding the spark.speculation.quantile configuration parameter. It incorrectly states it should be a percentage, when it should be a fraction. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I tried building the documentation but got some unidoc errors. I also got them when building off origin/master, so I don't think I caused that problem. I did run the web app and saw the changes reflected as expected. Author: Nicholas Brown Closes #14352 from nwbvt/master. (cherry picked from commit ba0aade6d517364363e07ed09278c2b44110c33b) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3d354748 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3d354748 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3d354748 Branch: refs/heads/branch-2.0 Commit: 3d35474872d3b117abc3fc7debcb1eb6409769d6 Parents: 9d581dc Author: Nicholas Brown Authored: Mon Jul 25 19:18:27 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 19:18:33 2016 -0700 -- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3d354748/docs/configuration.md -- diff --git a/docs/configuration.md b/docs/configuration.md index 86a9bd9..bf10b24 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1174,7 +1174,7 @@ Apart from these, the following properties are also available, and may be useful spark.speculation.quantile 0.75 -Percentage of tasks which must be complete before speculation is enabled for a particular stage. +Fraction of tasks which must be complete before speculation is enabled for a particular stage. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Fix description of spark.speculation.quantile
Repository: spark Updated Branches: refs/heads/master 3fc456694 -> ba0aade6d Fix description of spark.speculation.quantile ## What changes were proposed in this pull request? Minor doc fix regarding the spark.speculation.quantile configuration parameter. It incorrectly states it should be a percentage, when it should be a fraction. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) I tried building the documentation but got some unidoc errors. I also got them when building off origin/master, so I don't think I caused that problem. I did run the web app and saw the changes reflected as expected. Author: Nicholas Brown Closes #14352 from nwbvt/master. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ba0aade6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ba0aade6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ba0aade6 Branch: refs/heads/master Commit: ba0aade6d517364363e07ed09278c2b44110c33b Parents: 3fc4566 Author: Nicholas Brown Authored: Mon Jul 25 19:18:27 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 19:18:27 2016 -0700 -- docs/configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ba0aade6/docs/configuration.md -- diff --git a/docs/configuration.md b/docs/configuration.md index 86a9bd9..bf10b24 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1174,7 +1174,7 @@ Apart from these, the following properties are also available, and may be useful spark.speculation.quantile 0.75 -Percentage of tasks which must be complete before speculation is enabled for a particular stage. +Fraction of tasks which must be complete before speculation is enabled for a particular stage. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries
Repository: spark Updated Branches: refs/heads/master ba0aade6d -> 8a8d26f1e [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries ## What changes were proposed in this pull request? Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this. ```scala scala> sql("CREATE TABLE t1(a int)") scala> val df = sql("select * from t1 b where exists (select * from t1 a)") scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL java.lang.UnsupportedOperationException: empty.reduceLeft ``` ## How was this patch tested? Pass the Jenkins tests with a new test suite. Author: Dongjoon Hyun Closes #14307 from dongjoon-hyun/SPARK-16672. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8a8d26f1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8a8d26f1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8a8d26f1 Branch: refs/heads/master Commit: 8a8d26f1e27db5c2228307b1c3609b4713b9d0db Parents: ba0aade Author: Dongjoon Hyun Authored: Mon Jul 25 19:52:17 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 19:52:17 2016 -0700 -- .../scala/org/apache/spark/sql/catalyst/SQLBuilder.scala | 9 +++-- sql/hive/src/test/resources/sqlgen/predicate_subquery.sql | 4 .../apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala | 10 ++ 3 files changed, 21 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8a8d26f1/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala index a8cc72f..9a02e3c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala @@ -512,8 +512,13 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { ScalarSubquery(rewrite, Seq.empty, exprId) case PredicateSubquery(query, conditions, false, exprId) => - val plan = Project(Seq(Alias(Literal(1), "1")()), -Filter(conditions.reduce(And), addSubqueryIfNeeded(query))) + val subquery = addSubqueryIfNeeded(query) + val plan = if (conditions.isEmpty) { +subquery + } else { +Project(Seq(Alias(Literal(1), "1")()), + Filter(conditions.reduce(And), subquery)) + } Exists(plan, exprId) case PredicateSubquery(query, conditions, true, exprId) => http://git-wip-us.apache.org/repos/asf/spark/blob/8a8d26f1/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql -- diff --git a/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql b/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql new file mode 100644 index 000..2e06b4f --- /dev/null +++ b/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql @@ -0,0 +1,4 @@ +-- This file is automatically generated by LogicalPlanToSQLSuite. +select * from t1 b where exists (select * from t1 a) + +SELECT `gen_attr` AS `a` FROM (SELECT `gen_attr` FROM (SELECT `a` AS `gen_attr` FROM `default`.`t1`) AS gen_subquery_0 WHERE EXISTS(SELECT `gen_attr` AS `a` FROM ((SELECT `gen_attr` FROM (SELECT `a` AS `gen_attr` FROM `default`.`t1`) AS gen_subquery_0) AS gen_subquery_1) AS gen_subquery_1)) AS b http://git-wip-us.apache.org/repos/asf/spark/blob/8a8d26f1/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala index 1f5078d..ebece38 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala @@ -25,6 +25,7 @@ import scala.util.control.NonFatal import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.parser.ParseException import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SQLTestUtils /** @@ -927,6 +928,15 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } + test("predicate subquery") { +withTable("t1") { + withSQLConf(SQLConf.CROSS_JOINS_ENABLED.key -> "true") { +sql("CREATE TABLE t1(a int)
spark git commit: [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries
Repository: spark Updated Branches: refs/heads/branch-2.0 3d3547487 -> aeb6d5c05 [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queries ## What changes were proposed in this pull request? Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this. ```scala scala> sql("CREATE TABLE t1(a int)") scala> val df = sql("select * from t1 b where exists (select * from t1 a)") scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL java.lang.UnsupportedOperationException: empty.reduceLeft ``` ## How was this patch tested? Pass the Jenkins tests with a new test suite. Author: Dongjoon Hyun Closes #14307 from dongjoon-hyun/SPARK-16672. (cherry picked from commit 8a8d26f1e27db5c2228307b1c3609b4713b9d0db) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/aeb6d5c0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/aeb6d5c0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/aeb6d5c0 Branch: refs/heads/branch-2.0 Commit: aeb6d5c053d4e848df0e7842a3994154df464647 Parents: 3d35474 Author: Dongjoon Hyun Authored: Mon Jul 25 19:52:17 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 19:52:22 2016 -0700 -- .../scala/org/apache/spark/sql/catalyst/SQLBuilder.scala | 9 +++-- sql/hive/src/test/resources/sqlgen/predicate_subquery.sql | 4 .../apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala | 10 ++ 3 files changed, 21 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/aeb6d5c0/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala index a8cc72f..9a02e3c 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/SQLBuilder.scala @@ -512,8 +512,13 @@ class SQLBuilder(logicalPlan: LogicalPlan) extends Logging { ScalarSubquery(rewrite, Seq.empty, exprId) case PredicateSubquery(query, conditions, false, exprId) => - val plan = Project(Seq(Alias(Literal(1), "1")()), -Filter(conditions.reduce(And), addSubqueryIfNeeded(query))) + val subquery = addSubqueryIfNeeded(query) + val plan = if (conditions.isEmpty) { +subquery + } else { +Project(Seq(Alias(Literal(1), "1")()), + Filter(conditions.reduce(And), subquery)) + } Exists(plan, exprId) case PredicateSubquery(query, conditions, true, exprId) => http://git-wip-us.apache.org/repos/asf/spark/blob/aeb6d5c0/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql -- diff --git a/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql b/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql new file mode 100644 index 000..2e06b4f --- /dev/null +++ b/sql/hive/src/test/resources/sqlgen/predicate_subquery.sql @@ -0,0 +1,4 @@ +-- This file is automatically generated by LogicalPlanToSQLSuite. +select * from t1 b where exists (select * from t1 a) + +SELECT `gen_attr` AS `a` FROM (SELECT `gen_attr` FROM (SELECT `a` AS `gen_attr` FROM `default`.`t1`) AS gen_subquery_0 WHERE EXISTS(SELECT `gen_attr` AS `a` FROM ((SELECT `gen_attr` FROM (SELECT `a` AS `gen_attr` FROM `default`.`t1`) AS gen_subquery_0) AS gen_subquery_1) AS gen_subquery_1)) AS b http://git-wip-us.apache.org/repos/asf/spark/blob/aeb6d5c0/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala -- diff --git a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala index 1f5078d..ebece38 100644 --- a/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala +++ b/sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala @@ -25,6 +25,7 @@ import scala.util.control.NonFatal import org.apache.spark.sql.Column import org.apache.spark.sql.catalyst.parser.ParseException import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SQLTestUtils /** @@ -927,6 +928,15 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { } } + test("predicate subquery") { +withTable("t
spark git commit: [SPARK-16724] Expose DefinedByConstructorParams
Repository: spark Updated Branches: refs/heads/branch-2.0 aeb6d5c05 -> 4b38a6a53 [SPARK-16724] Expose DefinedByConstructorParams We don't generally make things in catalyst/execution private. Instead they are just undocumented due to their lack of stability guarantees. Author: Michael Armbrust Closes #14356 from marmbrus/patch-1. (cherry picked from commit f99e34e8e58c97ff30c6e054875533350d99fe5b) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4b38a6a5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4b38a6a5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4b38a6a5 Branch: refs/heads/branch-2.0 Commit: 4b38a6a534d93b1eab3b19f62a2f78474be1d8bc Parents: aeb6d5c Author: Michael Armbrust Authored: Mon Jul 25 20:41:24 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 20:41:29 2016 -0700 -- .../main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4b38a6a5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala index 78c145d..8affb03 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala @@ -30,7 +30,7 @@ import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} * for classes whose fields are entirely defined by constructor params but should not be * case classes. */ -private[sql] trait DefinedByConstructorParams +trait DefinedByConstructorParams /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16724] Expose DefinedByConstructorParams
Repository: spark Updated Branches: refs/heads/master 8a8d26f1e -> f99e34e8e [SPARK-16724] Expose DefinedByConstructorParams We don't generally make things in catalyst/execution private. Instead they are just undocumented due to their lack of stability guarantees. Author: Michael Armbrust Closes #14356 from marmbrus/patch-1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f99e34e8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f99e34e8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f99e34e8 Branch: refs/heads/master Commit: f99e34e8e58c97ff30c6e054875533350d99fe5b Parents: 8a8d26f Author: Michael Armbrust Authored: Mon Jul 25 20:41:24 2016 -0700 Committer: Reynold Xin Committed: Mon Jul 25 20:41:24 2016 -0700 -- .../main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f99e34e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala index 78c145d..8affb03 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala @@ -30,7 +30,7 @@ import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} * for classes whose fields are entirely defined by constructor params but should not be * case classes. */ -private[sql] trait DefinedByConstructorParams +trait DefinedByConstructorParams /** - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions
Repository: spark Updated Branches: refs/heads/master f99e34e8e -> 815f3eece [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions ## What changes were proposed in this pull request? This PR contains three changes. First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below: 1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value. 2. If the offset row does not exist, the default value will be used. 3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change). Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist. Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved. ## How was this patch tested? New tests in SQLWindowFunctionSuite Author: Yin Huai Closes #14284 from yhuai/lead-lag. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/815f3eec Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/815f3eec Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/815f3eec Branch: refs/heads/master Commit: 815f3eece5f095919a329af8cbd762b9ed71c7a8 Parents: f99e34e Author: Yin Huai Authored: Mon Jul 25 20:58:07 2016 -0700 Committer: Yin Huai Committed: Mon Jul 25 20:58:07 2016 -0700 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 3 +- .../expressions/windowExpressions.scala | 45 +- .../apache/spark/sql/execution/WindowExec.scala | 34 +- .../sql/execution/SQLWindowFunctionSuite.scala | 414 +++ .../hive/execution/SQLWindowFunctionSuite.scala | 370 - 5 files changed, 467 insertions(+), 399 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/815f3eec/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index d1d2c59..61162cc 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1787,7 +1787,8 @@ class Analyzer( s @ WindowSpecDefinition(_, o, UnspecifiedFrame)) if wf.frame != UnspecifiedFrame => WindowExpression(wf, s.copy(frameSpecification = wf.frame)) -case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, UnspecifiedFrame)) => +case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, UnspecifiedFrame)) + if e.resolved => val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, acceptWindowFrame = true) we.copy(windowSpec = s.copy(frameSpecification = frame)) } http://git-wip-us.apache.org/repos/asf/spark/blob/815f3eec/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index e35192c..6806591 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -321,8 +321,7 @@ abstract class OffsetWindowFunction val input: Expression /** - * Default result value for the function when the input expression returns NULL. The default will - * evaluated against the current row instead of the offset row. + * Default result value for the function when the 'offset'th row does not exist. */ val default: Expression @@ -348,7 +347,7 @@ abstract class OffsetWindowFunction */ override def foldable: Boolean = false - override def nullable: Boolean = default == null || default.nullable + override def nullable: Boolean = default == null || default.nullable || input.nullable override lazy val frame = { // This will be triggered by the Analyzer. @@ -373,20 +372,22 @@ abstract class OffsetWindowFunction } /** - * The Lead function returns the value of 'x' at 'offset' rows after the current row in the windo
spark git commit: [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions
Repository: spark Updated Branches: refs/heads/branch-2.0 4b38a6a53 -> 4391d4a3c [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to lead and lag functions ## What changes were proposed in this pull request? This PR contains three changes. First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below: 1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value. 2. If the offset row does not exist, the default value will be used. 3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change). Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist. Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved. ## How was this patch tested? New tests in SQLWindowFunctionSuite Author: Yin Huai Closes #14284 from yhuai/lead-lag. (cherry picked from commit 815f3eece5f095919a329af8cbd762b9ed71c7a8) Signed-off-by: Yin Huai Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4391d4a3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4391d4a3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4391d4a3 Branch: refs/heads/branch-2.0 Commit: 4391d4a3c60d59df625cbfdb918aa67c51ebcbc1 Parents: 4b38a6a Author: Yin Huai Authored: Mon Jul 25 20:58:07 2016 -0700 Committer: Yin Huai Committed: Mon Jul 25 20:58:57 2016 -0700 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 3 +- .../expressions/windowExpressions.scala | 45 +- .../apache/spark/sql/execution/WindowExec.scala | 34 +- .../sql/execution/SQLWindowFunctionSuite.scala | 414 +++ .../hive/execution/SQLWindowFunctionSuite.scala | 370 - 5 files changed, 467 insertions(+), 399 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4391d4a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index d1d2c59..61162cc 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1787,7 +1787,8 @@ class Analyzer( s @ WindowSpecDefinition(_, o, UnspecifiedFrame)) if wf.frame != UnspecifiedFrame => WindowExpression(wf, s.copy(frameSpecification = wf.frame)) -case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, UnspecifiedFrame)) => +case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, UnspecifiedFrame)) + if e.resolved => val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, acceptWindowFrame = true) we.copy(windowSpec = s.copy(frameSpecification = frame)) } http://git-wip-us.apache.org/repos/asf/spark/blob/4391d4a3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala index e35192c..6806591 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala @@ -321,8 +321,7 @@ abstract class OffsetWindowFunction val input: Expression /** - * Default result value for the function when the input expression returns NULL. The default will - * evaluated against the current row instead of the offset row. + * Default result value for the function when the 'offset'th row does not exist. */ val default: Expression @@ -348,7 +347,7 @@ abstract class OffsetWindowFunction */ override def foldable: Boolean = false - override def nullable: Boolean = default == null || default.nullable + override def nullable: Boolean = default == null || default.nullable || input.nullable override lazy val frame = { // This will be triggered by the Analyzer. @@ -373,20 +372,22 @@ abstract class OffsetWindowFunction }
spark git commit: [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning
Repository: spark Updated Branches: refs/heads/master 815f3eece -> 7b06a8948 [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by ColumnPruning ## What changes were proposed in this pull request? We push down `Project` through `Sample` in `Optimizer` by the rule `PushProjectThroughSample`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect. Since the rule `ColumnPruning` already handles general `Project` pushdown. We don't need `PushProjectThroughSample` anymore. The rule `ColumnPruning` also avoids the described issue. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh Closes #14327 from viirya/fix-sample-pushdown. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7b06a894 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7b06a894 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7b06a894 Branch: refs/heads/master Commit: 7b06a8948fc16d3c14e240fdd632b79ce1651008 Parents: 815f3ee Author: Liang-Chi Hsieh Authored: Tue Jul 26 12:00:01 2016 +0800 Committer: Wenchen Fan Committed: Tue Jul 26 12:00:01 2016 +0800 -- .../sql/catalyst/optimizer/Optimizer.scala | 12 -- .../catalyst/optimizer/ColumnPruningSuite.scala | 15 .../optimizer/FilterPushdownSuite.scala | 17 - .../org/apache/spark/sql/DatasetSuite.scala | 25 4 files changed, 40 insertions(+), 29 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7b06a894/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index c8e9d8e..fe328fd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -76,7 +76,6 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: CatalystConf) Batch("Operator Optimizations", fixedPoint, // Operator push down PushThroughSetOperations, - PushProjectThroughSample, ReorderJoin, EliminateOuterJoin, PushPredicateThroughJoin, @@ -149,17 +148,6 @@ class SimpleTestOptimizer extends Optimizer( new SimpleCatalystConf(caseSensitiveAnalysis = true)) /** - * Pushes projects down beneath Sample to enable column pruning with sampling. - */ -object PushProjectThroughSample extends Rule[LogicalPlan] { - def apply(plan: LogicalPlan): LogicalPlan = plan transform { -// Push down projection into sample -case Project(projectList, Sample(lb, up, replace, seed, child)) => - Sample(lb, up, replace, seed, Project(projectList, child))() - } -} - -/** * Removes the Project only conducting Alias of its child node. * It is created mainly for removing extra Project added in EliminateSerialization rule, * but can also benefit other operators. http://git-wip-us.apache.org/repos/asf/spark/blob/7b06a894/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala index b5664a5..589607e 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala @@ -346,5 +346,20 @@ class ColumnPruningSuite extends PlanTest { comparePlans(Optimize.execute(plan1.analyze), correctAnswer1) } + test("push project down into sample") { +val testRelation = LocalRelation('a.int, 'b.int, 'c.int) +val x = testRelation.subquery('x) + +val query1 = Sample(0.0, 0.6, false, 11L, x)().select('a) +val optimized1 = Optimize.execute(query1.analyze) +val expected1 = Sample(0.0, 0.6, false, 11L, x.select('a))() +comparePlans(optimized1, expected1.analyze) + +val query2 = Sample(0.0, 0.6, false, 11L, x)().select('a as 'aa) +val optimized2 = Optimize.execute(query2.analyze) +val expected2 = Sample(0.0, 0.6, f
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.0.0 [created] 13650fc58 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.0.0-rc5 [deleted] 13650fc58 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.0.0-rc4 [deleted] e5f8c1117 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.0.0-rc2 [deleted] 4a55b2326 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.0.0-rc3 [deleted] 48d1fa3e7 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.0.0-rc1 [deleted] 0c66ca41a - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org