date:20180813

svn commit: r28683 - in /dev/spark/2.4.0-SNAPSHOT-2018_08_13_00_02-a992827-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-08-13 Thread pwendell

Author: pwendell
Date: Mon Aug 13 07:16:40 2018
New Revision: 28683

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_08_13_00_02-a992827 docs


[This commit notification would consist of 1476 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25096][SQL] Loosen nullability if the cast is force-nullable.

2018-08-13 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master a9928277d -> b270bccff


[SPARK-25096][SQL] Loosen nullability if the cast is force-nullable.

## What changes were proposed in this pull request?

In type coercion for complex types, if the found type is force-nullable to 
cast, we should loosen the nullability to be able to cast. Also for map key 
type, we can't use the type.

## How was this patch tested?

Added some test.

Closes #22086 from ueshin/issues/SPARK-25096/fix_type_coercion.

Authored-by: Takuya UESHIN 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b270bccf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b270bccf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b270bccf

Branch: refs/heads/master
Commit: b270bccb21b814e77ae55c1b74bc25d7
Parents: a992827
Author: Takuya UESHIN 
Authored: Mon Aug 13 19:27:17 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Aug 13 19:27:17 2018 +0800

--
 .../sql/catalyst/analysis/TypeCoercion.scala| 21 +---
 .../catalyst/analysis/TypeCoercionSuite.scala   | 16 +++
 2 files changed, 30 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b270bccf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
index 27839d7..10d9ee5 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
@@ -153,19 +153,26 @@ object TypeCoercion {
   t2: DataType,
   findTypeFunc: (DataType, DataType) => Option[DataType]): 
Option[DataType] = (t1, t2) match {
 case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) =>
-  findTypeFunc(et1, et2).map(ArrayType(_, containsNull1 || containsNull2))
+  findTypeFunc(et1, et2).map { et =>
+ArrayType(et, containsNull1 || containsNull2 ||
+  Cast.forceNullable(et1, et) || Cast.forceNullable(et2, et))
+  }
 case (MapType(kt1, vt1, valueContainsNull1), MapType(kt2, vt2, 
valueContainsNull2)) =>
-  findTypeFunc(kt1, kt2).flatMap { kt =>
-findTypeFunc(vt1, vt2).map { vt =>
-  MapType(kt, vt, valueContainsNull1 || valueContainsNull2)
-}
+  findTypeFunc(kt1, kt2)
+.filter { kt => !Cast.forceNullable(kt1, kt) && 
!Cast.forceNullable(kt2, kt) }
+.flatMap { kt =>
+  findTypeFunc(vt1, vt2).map { vt =>
+MapType(kt, vt, valueContainsNull1 || valueContainsNull2 ||
+  Cast.forceNullable(vt1, vt) || Cast.forceNullable(vt2, vt))
+  }
   }
 case (StructType(fields1), StructType(fields2)) if fields1.length == 
fields2.length =>
   val resolver = SQLConf.get.resolver
   fields1.zip(fields2).foldLeft(Option(new StructType())) {
 case (Some(struct), (field1, field2)) if resolver(field1.name, 
field2.name) =>
-  findTypeFunc(field1.dataType, field2.dataType).map {
-dt => struct.add(field1.name, dt, field1.nullable || 
field2.nullable)
+  findTypeFunc(field1.dataType, field2.dataType).map { dt =>
+struct.add(field1.name, dt, field1.nullable || field2.nullable ||
+  Cast.forceNullable(field1.dataType, dt) || 
Cast.forceNullable(field2.dataType, dt))
   }
 case _ => None
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/b270bccf/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
index d71bbb3..2c6cb3a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala
@@ -499,6 +499,10 @@ class TypeCoercionSuite extends AnalysisTest {
   ArrayType(new StructType().add("num", ShortType), containsNull = false),
   ArrayType(new StructType().add("num", LongType), containsNull = false),
   Some(ArrayType(new StructType().add("num", LongType), containsNull = 
false)))
+widenTestWithStringPromotion(
+  ArrayType(IntegerType, containsNull = false),
+  ArrayType(DecimalType.IntDecimal, containsNull = false),
+  Some(ArrayType(Dec

spark git commit: [SPARK-24391][SQL] Support arrays of any types by from_json

2018-08-13 Thread gurwls223

Repository: spark
Updated Branches:
  refs/heads/master b270bccff -> ab06c2535


[SPARK-24391][SQL] Support arrays of any types by from_json

## What changes were proposed in this pull request?

The PR removes a restriction for element types of array type which exists in 
`from_json` for the root type. Currently, the function can handle only arrays 
of structs. Even array of primitive types is disallowed. The PR allows arrays 
of any types currently supported by JSON datasource. Here is an example of an 
array of a primitive type:

```
scala> import org.apache.spark.sql.functions._
scala> val df = Seq("[1, 2, 3]").toDF("a")
scala> val schema = new ArrayType(IntegerType, false)
scala> val arr = df.select(from_json($"a", schema))
scala> arr.printSchema
root
 |-- jsontostructs(a): array (nullable = true)
 ||-- element: integer (containsNull = true)
```
and result of converting of the json string to the `ArrayType`:
```
scala> arr.show
++
|jsontostructs(a)|
++
|   [1, 2, 3]|
++
```

## How was this patch tested?

I added a few positive and negative tests:
- array of primitive types
- array of arrays
- array of structs
- array of maps

Closes #21439 from MaxGekk/from_json-array.

Lead-authored-by: Maxim Gekk 
Co-authored-by: Maxim Gekk 
Signed-off-by: hyukjinkwon 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab06c253
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab06c253
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab06c253

Branch: refs/heads/master
Commit: ab06c25350f8a997bef0c3dd8aa82b709e7dfb3f
Parents: b270bcc
Author: Maxim Gekk 
Authored: Mon Aug 13 20:13:09 2018 +0800
Committer: hyukjinkwon 
Committed: Mon Aug 13 20:13:09 2018 +0800

--
 python/pyspark/sql/functions.py |  7 +-
 .../catalyst/expressions/jsonExpressions.scala  | 19 ++---
 .../spark/sql/catalyst/json/JacksonParser.scala | 30 
 .../scala/org/apache/spark/sql/functions.scala  | 10 +--
 .../sql-tests/inputs/json-functions.sql | 12 
 .../sql-tests/results/json-functions.sql.out| 66 -
 .../apache/spark/sql/JsonFunctionsSuite.scala   | 76 ++--
 7 files changed, 194 insertions(+), 26 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ab06c253/python/pyspark/sql/functions.py
--
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index eaecf28..f583373 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2241,7 +2241,7 @@ def json_tuple(col, *fields):
 def from_json(col, schema, options={}):
 """
 Parses a column containing a JSON string into a :class:`MapType` with 
:class:`StringType`
-as keys type, :class:`StructType` or :class:`ArrayType` of 
:class:`StructType`\\s with
+as keys type, :class:`StructType` or :class:`ArrayType` with
 the specified schema. Returns `null`, in the case of an unparseable string.
 
 :param col: string column in json format
@@ -2269,6 +2269,11 @@ def from_json(col, schema, options={}):
 >>> schema = schema_of_json(lit('''{"a": 0}'''))
 >>> df.select(from_json(df.value, schema).alias("json")).collect()
 [Row(json=Row(a=1))]
+>>> data = [(1, '''[1, 2, 3]''')]
+>>> schema = ArrayType(IntegerType())
+>>> df = spark.createDataFrame(data, ("key", "value"))
+>>> df.select(from_json(df.value, schema).alias("json")).collect()
+[Row(json=[1, 2, 3])]
 """
 
 sc = SparkContext._active_spark_context

http://git-wip-us.apache.org/repos/asf/spark/blob/ab06c253/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
index abe8875..ca99100 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@@ -495,7 +495,7 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] or [[ArrayType]] of 
[[StructType]]s
+ * Converts an json input string to a [[StructType]], [[ArrayType]] or 
[[MapType]]
  * with the specified schema.
  */
 // scalastyle:off line.size.limit
@@ -544,17 +544,10 @@ case class JsonToStructs(
   timeZoneId = None)
 
   override def checkInputDataTypes(): TypeCheckResult = nullableSchema match {
-case _: StructType | ArrayType(_: StructType, _) | _: Map

spark git commit: [SPARK-25099][SQL][TEST] Generate Avro Binary files in test suite

2018-08-13 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master ab06c2535 -> 26775e3c8


[SPARK-25099][SQL][TEST] Generate Avro Binary files in test suite

## What changes were proposed in this pull request?

In PR https://github.com/apache/spark/pull/21984 and 
https://github.com/apache/spark/pull/21935 , the related test cases are using 
binary files created by Python scripts.

Generate the binary files in test suite to make it more transparent.  Also we 
can

Also move the related test cases to a new file `AvroLogicalTypeSuite.scala`.

## How was this patch tested?

Unit test.

Closes #22091 from gengliangwang/logicalType_suite.

Authored-by: Gengliang Wang 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/26775e3c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/26775e3c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/26775e3c

Branch: refs/heads/master
Commit: 26775e3c8ed5bf9028253280b57da64678363f8a
Parents: ab06c25
Author: Gengliang Wang 
Authored: Mon Aug 13 20:50:28 2018 +0800
Committer: Wenchen Fan 
Committed: Mon Aug 13 20:50:28 2018 +0800

--
 external/avro/src/test/resources/date.avro  | Bin 209 -> 0 bytes
 external/avro/src/test/resources/timestamp.avro | Bin 375 -> 0 bytes
 .../spark/sql/avro/AvroLogicalTypeSuite.scala   | 298 +++
 .../org/apache/spark/sql/avro/AvroSuite.scala   | 242 +--
 4 files changed, 299 insertions(+), 241 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/26775e3c/external/avro/src/test/resources/date.avro
--
diff --git a/external/avro/src/test/resources/date.avro 
b/external/avro/src/test/resources/date.avro
deleted file mode 100644
index 3a67617..000
Binary files a/external/avro/src/test/resources/date.avro and /dev/null differ

http://git-wip-us.apache.org/repos/asf/spark/blob/26775e3c/external/avro/src/test/resources/timestamp.avro
--
diff --git a/external/avro/src/test/resources/timestamp.avro 
b/external/avro/src/test/resources/timestamp.avro
deleted file mode 100644
index daef50b..000
Binary files a/external/avro/src/test/resources/timestamp.avro and /dev/null 
differ

http://git-wip-us.apache.org/repos/asf/spark/blob/26775e3c/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala
--
diff --git 
a/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala
 
b/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala
new file mode 100644
index 000..24d8c53
--- /dev/null
+++ 
b/external/avro/src/test/scala/org/apache/spark/sql/avro/AvroLogicalTypeSuite.scala
@@ -0,0 +1,298 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.avro
+
+import java.io.File
+import java.sql.Timestamp
+
+import org.apache.avro.{LogicalTypes, Schema}
+import org.apache.avro.Conversions.DecimalConversion
+import org.apache.avro.file.DataFileWriter
+import org.apache.avro.generic.{GenericData, GenericDatumWriter, GenericRecord}
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.{SharedSQLContext, SQLTestUtils}
+import org.apache.spark.sql.types.{StructField, StructType, TimestampType}
+
+class AvroLogicalTypeSuite extends QueryTest with SharedSQLContext with 
SQLTestUtils {
+  import testImplicits._
+
+  val dateSchema = s"""
+  {
+"namespace": "logical",
+"type": "record",
+"name": "test",
+"fields": [
+  {"name": "date", "type": {"type": "int", "logicalType": "date"}}
+]
+  }
+"""
+
+  val dateInputData = Seq(7, 365, 0)
+
+  def dateFile(path: String): String = {
+val schema = new Schema.Parser().pars

spark git commit: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spilled during iteration

2018-08-13 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 26775e3c8 -> 2e3abdff2


[SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spilled during iteration

## What changes were proposed in this pull request?
This PR solves [SPARK-22713](https://issues.apache.org/jira/browse/SPARK-22713) 
which describes a memory leak that occurs when and ExternalAppendOnlyMap is 
spilled during iteration (opposed to  insertion).

(Please fill in changes proposed in this fix)
ExternalAppendOnlyMap's iterator supports spilling but it kept a reference to 
the internal map (via an internal iterator) after spilling, it seems that the 
original code was actually supposed to 'get rid' of this reference on the next 
iteration but according to the elaborate investigation described in the JIRA 
this didn't happen.
the fix was simply replacing the internal iterator immediately after spilling.

## How was this patch tested?
I've introduced a new test to test suite ExternalAppendOnlyMapSuite, this test 
asserts that neither the external map itself nor its iterator hold any 
reference to the internal map after a spill.
These approach required some access relaxation of some members variables and 
nested classes of ExternalAppendOnlyMap, this members are now package provate 
and annotated with VisibleForTesting.

Closes #21369 from eyalfa/SPARK-22713__ExternalAppendOnlyMap_effective_spill.

Authored-by: Eyal Farago 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2e3abdff
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2e3abdff
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2e3abdff

Branch: refs/heads/master
Commit: 2e3abdff23a0725b80992cc30dba2ecf9c2e7fd3
Parents: 26775e3
Author: Eyal Farago 
Authored: Mon Aug 13 20:55:46 2018 +0800
Committer: Wenchen Fan 
Committed: Mon Aug 13 20:55:46 2018 +0800

--
 .../util/collection/ExternalAppendOnlyMap.scala |  35 +++---
 .../collection/ExternalAppendOnlyMapSuite.scala | 119 ++-
 2 files changed, 138 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2e3abdff/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
 
b/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
index d83da0d..19ff109 100644
--- 
a/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
+++ 
b/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
@@ -80,7 +80,10 @@ class ExternalAppendOnlyMap[K, V, C](
 this(createCombiner, mergeValue, mergeCombiners, serializer, blockManager, 
TaskContext.get())
   }
 
-  @volatile private var currentMap = new SizeTrackingAppendOnlyMap[K, C]
+  /**
+   * Exposed for testing
+   */
+  @volatile private[collection] var currentMap = new 
SizeTrackingAppendOnlyMap[K, C]
   private val spilledMaps = new ArrayBuffer[DiskMapIterator]
   private val sparkConf = SparkEnv.get.conf
   private val diskBlockManager = blockManager.diskBlockManager
@@ -267,7 +270,7 @@ class ExternalAppendOnlyMap[K, V, C](
*/
   def destructiveIterator(inMemoryIterator: Iterator[(K, C)]): Iterator[(K, 
C)] = {
 readingIterator = new SpillableIterator(inMemoryIterator)
-readingIterator
+readingIterator.toCompletionIterator
   }
 
   /**
@@ -280,8 +283,7 @@ class ExternalAppendOnlyMap[K, V, C](
 "ExternalAppendOnlyMap.iterator is destructive and should only be 
called once.")
 }
 if (spilledMaps.isEmpty) {
-  CompletionIterator[(K, C), Iterator[(K, C)]](
-destructiveIterator(currentMap.iterator), freeCurrentMap())
+  destructiveIterator(currentMap.iterator)
 } else {
   new ExternalIterator()
 }
@@ -305,8 +307,8 @@ class ExternalAppendOnlyMap[K, V, C](
 
 // Input streams are derived both from the in-memory map and spilled maps 
on disk
 // The in-memory map is sorted in place, while the spilled maps are 
already in sorted order
-private val sortedMap = CompletionIterator[(K, C), Iterator[(K, 
C)]](destructiveIterator(
-  currentMap.destructiveSortedIterator(keyComparator)), freeCurrentMap())
+private val sortedMap = destructiveIterator(
+  currentMap.destructiveSortedIterator(keyComparator))
 private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => 
it.buffered)
 
 inputStreams.foreach { it =>
@@ -568,13 +570,11 @@ class ExternalAppendOnlyMap[K, V, C](
 context.addTaskCompletionListener[Unit](context => cleanup())
   }
 
-  private[this] class SpillableIterator(var upstream: Iterator[(K, C)])
+  private class SpillableIterator(var upstream:

spark git commit: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments, and add argument type check.

2018-08-13 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master 2e3abdff2 -> b804ca577


[SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments, and add argument type 
check.

## What changes were proposed in this pull request?

This is a follow-up pr of #21954 to address comments.

- Rename ambiguous name `inputs` to `arguments`.
- Add argument type check and remove hacky workaround.
- Address other small comments.

## How was this patch tested?

Existing tests and some additional tests.

Closes #22075 from ueshin/issues/SPARK-23908/fup1.

Authored-by: Takuya UESHIN 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b804ca57
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b804ca57
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b804ca57

Branch: refs/heads/master
Commit: b804ca57718ad1568458d8185c8c30118be8275f
Parents: 2e3abdf
Author: Takuya UESHIN 
Authored: Mon Aug 13 20:58:29 2018 +0800
Committer: Wenchen Fan 
Committed: Mon Aug 13 20:58:29 2018 +0800

--
 .../sql/catalyst/analysis/CheckAnalysis.scala   |  14 ++
 .../analysis/higherOrderFunctions.scala |  12 +-
 .../expressions/ExpectsInputTypes.scala |  16 +-
 .../expressions/higherOrderFunctions.scala  | 181 ++-
 .../spark/sql/catalyst/plans/PlanTest.scala |   2 +-
 .../spark/sql/DataFrameFunctionsSuite.scala |  25 +++
 6 files changed, 152 insertions(+), 98 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b804ca57/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 4addc83..6a91d55 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -90,6 +90,20 @@ trait CheckAnalysis extends PredicateHelper {
 u.failAnalysis(s"Table or view not found: ${u.tableIdentifier}")
 
   case operator: LogicalPlan =>
+// Check argument data types of higher-order functions downwards first.
+// If the arguments of the higher-order functions are resolved but the 
type check fails,
+// the argument functions will not get resolved, but we should report 
the argument type
+// check failure instead of claiming the argument functions are 
unresolved.
+operator transformExpressionsDown {
+  case hof: HigherOrderFunction
+  if hof.argumentsResolved && 
hof.checkArgumentDataTypes().isFailure =>
+hof.checkArgumentDataTypes() match {
+  case TypeCheckResult.TypeCheckFailure(message) =>
+hof.failAnalysis(
+  s"cannot resolve '${hof.sql}' due to argument data type 
mismatch: $message")
+}
+}
+
 operator transformExpressionsUp {
   case a: Attribute if !a.resolved =>
 val from = operator.inputSet.map(_.qualifiedName).mkString(", ")

http://git-wip-us.apache.org/repos/asf/spark/blob/b804ca57/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala
index 5e2029c..dd08190 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala
@@ -95,15 +95,15 @@ case class ResolveLambdaVariables(conf: SQLConf) extends 
Rule[LogicalPlan] {
*/
   private def createLambda(
   e: Expression,
-  partialArguments: Seq[(DataType, Boolean)]): LambdaFunction = e match {
+  argInfo: Seq[(DataType, Boolean)]): LambdaFunction = e match {
 case f: LambdaFunction if f.bound => f
 
 case LambdaFunction(function, names, _) =>
-  if (names.size != partialArguments.size) {
+  if (names.size != argInfo.size) {
 e.failAnalysis(
   s"The number of lambda function arguments '${names.size}' does not " 
+
 "match the number of arguments expected by the higher order 
function " +
-s"'${partialArguments.size}'.")
+s"'${argInfo.size}'.")
   }
 
   if (names.map(a => canonicalizer(a.name)).distinct.size < names.size) {
@@ -111,7 +111,7 @@ case class ResolveLambdaVariable

spark-website git commit: Add CVE-2018-11770

2018-08-13 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site a63b5f427 -> e33a4bb7d


Add CVE-2018-11770


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/e33a4bb7
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/e33a4bb7
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/e33a4bb7

Branch: refs/heads/asf-site
Commit: e33a4bb7d8bbc25bb6a7d96c8bd6c13e3b05e77b
Parents: a63b5f4
Author: Sean Owen 
Authored: Mon Aug 13 09:25:05 2018 -0500
Committer: Sean Owen 
Committed: Mon Aug 13 09:25:05 2018 -0500

--
 security.md| 62 +--
 site/security.html | 99 +++--
 2 files changed, 138 insertions(+), 23 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/e33a4bb7/security.md
--
diff --git a/security.md b/security.md
index f99b9bd..19231f6 100644
--- a/security.md
+++ b/security.md
@@ -10,15 +10,55 @@ navigation:
 Reporting Security Issues
 
 Apache Spark uses the standard process outlined by the [Apache Security 
Team](https://www.apache.org/security/)
-for reporting vulnerabilities.
+for reporting vulnerabilities. Note that vulnerabilities should not be 
publicly disclosed until the project has
+responded.
 
 To report a possible security vulnerability, please email 
`secur...@apache.org`. This is a
 non-public list that will reach the Apache Security team, as well as the Spark 
PMC.
 
 Known Security Issues
 
+CVE-2018-11770: Apache Spark standalone master, Mesos 
REST APIs not controlled by authentication
+
+Severity: Medium
+
+Vendor: The Apache Software Foundation
+
+Versions Affected:
+
+- Spark versions from 1.3.0, running standalone master with REST API enabled, 
or running Mesos master with cluster mode enabled
+
+Description:
+
+From version 1.3.0 onward, Spark's standalone master exposes a REST API for 
job submission, in addition 
+to the submission mechanism used by `spark-submit`. In standalone, the config 
property 
+`spark.authenticate.secret` establishes a shared secret for authenticating 
requests to submit jobs via 
+`spark-submit`. However, the REST API does not use this or any other 
authentication mechanism, and this is 
+not adequately documented. In this case, a user would be able to run a driver 
program without authenticating, 
+but not launch executors, using the REST API. This REST API is also used by 
Mesos, when set up to run in 
+cluster mode (i.e., when also running `MesosClusterDispatcher`), for job 
submission. Future versions of Spark 
+will improve documentation on these points, and prohibit setting 
`spark.authenticate.secret` when running 
+the REST APIs, to make this clear. Future versions will also disable the REST 
API by default in the 
+standalone master by changing the default value of `spark.master.rest.enabled` 
to `false`.
+
+Mitigation:
+
+For standalone masters, disable the REST API by setting 
`spark.master.rest.enabled` to `false` if it is unused, 
+and/or ensure that all network access to the REST API (port 6066 by default) 
is restricted to hosts that are 
+trusted to submit jobs. Mesos users can stop the `MesosClusterDispatcher`, 
though that will prevent them 
+from running jobs in cluster mode. Alternatively, they can ensure access to 
the `MesosRestSubmissionServer` 
+(port 7077 by default) is restricted to trusted hosts.
+
+Credit:
+
+- Imran Rashid, Cloudera
+- Fengwei Zhang, Alibaba Cloud Security Team
+
+
 CVE-2018-8024: Apache Spark XSS vulnerability in UI
 
+Severity: Medium
+
 Versions Affected:
 
 - Spark versions through 2.1.2
@@ -26,6 +66,7 @@ Versions Affected:
 - Spark 2.3.0
 
 Description:
+
 In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, it's 
possible for a malicious 
 user to construct a URL pointing to a Spark cluster's UI's job and stage info 
pages, and if a user can 
 be tricked into accessing the URL, can be used to cause script to execute and 
expose information from 
@@ -55,6 +96,7 @@ Versions affected:
 - Spark 2.3.0
 
 Description:
+
 In Apache Spark up to and including 2.1.2, 2.2.0 to 2.2.1, and 2.3.0, when 
using PySpark or SparkR, 
 it's possible for a different local user to connect to the Spark application 
and impersonate the 
 user running the Spark application.
@@ -79,9 +121,11 @@ Severity: Medium
 Vendor: The Apache Software Foundation
 
 Versions Affected:
-Versions of Apache Spark from 1.6.0 until 2.1.1
+
+- Versions of Apache Spark from 1.6.0 until 2.1.1
 
 Description:
+
 In Apache Spark 1.6.0 until 2.1.1, the launcher API performs unsafe
 deserialization of data received by  its socket. This makes applications
 launched programmatically using the launcher API potentially
@@ -92,6 +13

svn commit: r28694 - in /dev/spark/2.4.0-SNAPSHOT-2018_08_13_08_02-b804ca5-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-08-13 Thread pwendell

Author: pwendell
Date: Mon Aug 13 15:16:09 2018
New Revision: 28694

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_08_13_08_02-b804ca5 docs


[This commit notification would consist of 1476 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values

2018-08-13 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/master b804ca577 -> c220cc42a


[SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values

## What changes were proposed in this pull request?

`ANALYZE TABLE ... PARTITION(...) COMPUTE STATISTICS` can fail with a NPE if a 
partition column contains a NULL value.

The PR avoids the NPE, replacing the `NULL` values with the default partition 
placeholder.

## How was this patch tested?

added UT

Closes #22036 from mgaido91/SPARK-25028.

Authored-by: Marco Gaido 
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c220cc42
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c220cc42
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c220cc42

Branch: refs/heads/master
Commit: c220cc42abebbc98a6110b50f787eb6d338c2d97
Parents: b804ca5
Author: Marco Gaido 
Authored: Tue Aug 14 00:59:18 2018 +0800
Committer: Wenchen Fan 
Committed: Tue Aug 14 00:59:18 2018 +0800

--
 .../command/AnalyzePartitionCommand.scala | 10 --
 .../spark/sql/StatisticsCollectionSuite.scala | 18 ++
 2 files changed, 26 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c220cc42/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
index 5b54b22..18fefa0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.command
 import org.apache.spark.sql.{AnalysisException, Column, Row, SparkSession}
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.{NoSuchPartitionException, 
UnresolvedAttribute}
-import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
ExternalCatalogUtils}
 import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
 import org.apache.spark.sql.catalyst.expressions.{And, EqualTo, Literal}
 import org.apache.spark.sql.execution.datasources.PartitioningUtils
@@ -140,7 +140,13 @@ case class AnalyzePartitionCommand(
 val df = tableDf.filter(Column(filter)).groupBy(partitionColumns: 
_*).count()
 
 df.collect().map { r =>
-  val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)
+  val partitionColumnValues = partitionColumns.indices.map { i =>
+if (r.isNullAt(i)) {
+  ExternalCatalogUtils.DEFAULT_PARTITION_NAME
+} else {
+  r.get(i).toString
+}
+  }
   val spec = 
tableMeta.partitionColumnNames.zip(partitionColumnValues).toMap
   val count = BigInt(r.getLong(partitionColumns.size))
   (spec, count)

http://git-wip-us.apache.org/repos/asf/spark/blob/c220cc42/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
index 60fa951..cb562d6 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
@@ -204,6 +204,24 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("SPARK-25028: column stats collection for null partitioning columns") {
+val table = "analyze_partition_with_null"
+withTempDir { dir =>
+  withTable(table) {
+sql(s"""
+ |CREATE TABLE $table (value string, name string)
+ |USING PARQUET
+ |PARTITIONED BY (name)
+ |LOCATION '${dir.toURI}'""".stripMargin)
+val df = Seq(("a", null), ("b", null)).toDF("value", "name")
+df.write.mode("overwrite").insertInto(table)
+sql(s"ANALYZE TABLE $table PARTITION (name) COMPUTE STATISTICS")
+val partitions = 
spark.sessionState.catalog.listPartitions(TableIdentifier(table))
+assert(partitions.head.stats.get.rowCount.get == 2)
+  }
+}
+  }
+
   test("number format in statistics") {
 val numbers = Seq(
   BigInt(0) -> (("0.0 B", "0")),


-
To un

spark git commit: [SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values

2018-08-13 Thread wenchen

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 b9b35b959 -> 787790b3c


[SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values

## What changes were proposed in this pull request?

`ANALYZE TABLE ... PARTITION(...) COMPUTE STATISTICS` can fail with a NPE if a 
partition column contains a NULL value.

The PR avoids the NPE, replacing the `NULL` values with the default partition 
placeholder.

## How was this patch tested?

added UT

Closes #22036 from mgaido91/SPARK-25028.

Authored-by: Marco Gaido 
Signed-off-by: Wenchen Fan 
(cherry picked from commit c220cc42abebbc98a6110b50f787eb6d338c2d97)
Signed-off-by: Wenchen Fan 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/787790b3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/787790b3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/787790b3

Branch: refs/heads/branch-2.3
Commit: 787790b3c733085b8b5e95cf832dedd481ab3b9a
Parents: b9b35b9
Author: Marco Gaido 
Authored: Tue Aug 14 00:59:18 2018 +0800
Committer: Wenchen Fan 
Committed: Tue Aug 14 00:59:54 2018 +0800

--
 .../command/AnalyzePartitionCommand.scala | 10 --
 .../spark/sql/StatisticsCollectionSuite.scala | 18 ++
 2 files changed, 26 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/787790b3/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
index 5b54b22..18fefa0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzePartitionCommand.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.command
 import org.apache.spark.sql.{AnalysisException, Column, Row, SparkSession}
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.{NoSuchPartitionException, 
UnresolvedAttribute}
-import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType}
+import org.apache.spark.sql.catalyst.catalog.{CatalogTable, CatalogTableType, 
ExternalCatalogUtils}
 import org.apache.spark.sql.catalyst.catalog.CatalogTypes.TablePartitionSpec
 import org.apache.spark.sql.catalyst.expressions.{And, EqualTo, Literal}
 import org.apache.spark.sql.execution.datasources.PartitioningUtils
@@ -140,7 +140,13 @@ case class AnalyzePartitionCommand(
 val df = tableDf.filter(Column(filter)).groupBy(partitionColumns: 
_*).count()
 
 df.collect().map { r =>
-  val partitionColumnValues = 
partitionColumns.indices.map(r.get(_).toString)
+  val partitionColumnValues = partitionColumns.indices.map { i =>
+if (r.isNullAt(i)) {
+  ExternalCatalogUtils.DEFAULT_PARTITION_NAME
+} else {
+  r.get(i).toString
+}
+  }
   val spec = 
tableMeta.partitionColumnNames.zip(partitionColumnValues).toMap
   val count = BigInt(r.getLong(partitionColumns.size))
   (spec, count)

http://git-wip-us.apache.org/repos/asf/spark/blob/787790b3/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
index b11e798..0e7209a 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
@@ -198,6 +198,24 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
 }
   }
 
+  test("SPARK-25028: column stats collection for null partitioning columns") {
+val table = "analyze_partition_with_null"
+withTempDir { dir =>
+  withTable(table) {
+sql(s"""
+ |CREATE TABLE $table (value string, name string)
+ |USING PARQUET
+ |PARTITIONED BY (name)
+ |LOCATION '${dir.toURI}'""".stripMargin)
+val df = Seq(("a", null), ("b", null)).toDF("value", "name")
+df.write.mode("overwrite").insertInto(table)
+sql(s"ANALYZE TABLE $table PARTITION (name) COMPUTE STATISTICS")
+val partitions = 
spark.sessionState.catalog.listPartitions(TableIdentifier(table))
+assert(partitions.head.stats.get.rowCount.get == 2)
+  }
+}
+  }
+
   test("number format in statistics") {
 val numbers = Seq(
   Big

svn commit: r28695 - in /dev/spark/2.3.3-SNAPSHOT-2018_08_13_10_01-787790b-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-08-13 Thread pwendell

Author: pwendell
Date: Mon Aug 13 17:15:24 2018
New Revision: 28695

Log:
Apache Spark 2.3.3-SNAPSHOT-2018_08_13_10_01-787790b docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r28697 - in /dev/spark/2.4.0-SNAPSHOT-2018_08_13_12_01-c220cc4-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-08-13 Thread pwendell

Author: pwendell
Date: Mon Aug 13 19:16:02 2018
New Revision: 28697

Log:
Apache Spark 2.4.0-SNAPSHOT-2018_08_13_12_01-c220cc4 docs


[This commit notification would consist of 1476 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: Stash pride logo for next year

2018-08-13 Thread srowen

Repository: spark-website
Updated Branches:
  refs/heads/asf-site e33a4bb7d -> 8eb764260


Stash pride logo for next year


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/8eb76426
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/8eb76426
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/8eb76426

Branch: refs/heads/asf-site
Commit: 8eb764260f5308960c69c212c642cd19ededf3ed
Parents: e33a4bb
Author: Sean Owen 
Authored: Sat Aug 11 21:35:01 2018 -0500
Committer: Sean Owen 
Committed: Mon Aug 13 20:12:03 2018 -0500

--
 images/spark-logo-trademark.png  | Bin 49720 -> 26999 bytes
 images/spark-logo.png| Bin 49720 -> 26999 bytes
 site/images/spark-logo-trademark.png | Bin 49720 -> 26999 bytes
 site/images/spark-logo.png   | Bin 49720 -> 26999 bytes
 4 files changed, 0 insertions(+), 0 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/8eb76426/images/spark-logo-trademark.png
--
diff --git a/images/spark-logo-trademark.png b/images/spark-logo-trademark.png
index eab639f..16702a9 100644
Binary files a/images/spark-logo-trademark.png and 
b/images/spark-logo-trademark.png differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/8eb76426/images/spark-logo.png
--
diff --git a/images/spark-logo.png b/images/spark-logo.png
index eab639f..16702a9 100644
Binary files a/images/spark-logo.png and b/images/spark-logo.png differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/8eb76426/site/images/spark-logo-trademark.png
--
diff --git a/site/images/spark-logo-trademark.png 
b/site/images/spark-logo-trademark.png
index eab639f..16702a9 100644
Binary files a/site/images/spark-logo-trademark.png and 
b/site/images/spark-logo-trademark.png differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/8eb76426/site/images/spark-logo.png
--
diff --git a/site/images/spark-logo.png b/site/images/spark-logo.png
index eab639f..16702a9 100644
Binary files a/site/images/spark-logo.png and b/site/images/spark-logo.png 
differ


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/2] spark git commit: Preparing Spark release v2.3.2-rc5

2018-08-13 Thread jshao

Repository: spark
Updated Branches:
  refs/heads/branch-2.3 787790b3c -> 29a040361


Preparing Spark release v2.3.2-rc5


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4dc82259
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4dc82259
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4dc82259

Branch: refs/heads/branch-2.3
Commit: 4dc82259d81102e0cb48f4cb2e8075f80d899ac4
Parents: 787790b
Author: Saisai Shao 
Authored: Tue Aug 14 02:55:09 2018 +
Committer: Saisai Shao 
Committed: Tue Aug 14 02:55:09 2018 +

--
 R/pkg/DESCRIPTION | 2 +-
 assembly/pom.xml  | 2 +-
 common/kvstore/pom.xml| 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 4 ++--
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 hadoop-cloud/pom.xml  | 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/kubernetes/core/pom.xml | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 41 files changed, 42 insertions(+), 42 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4dc82259/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 6ec4966..8df2635 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 2.3.3
+Version: 2.3.2
 Title: R Frontend for Apache Spark
 Description: Provides an R Frontend for Apache Spark.
 Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),

http://git-wip-us.apache.org/repos/asf/spark/blob/4dc82259/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index f8b15cc..57485fc 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.2
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/4dc82259/common/kvstore/pom.xml
--
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index e412a47..53e58c2 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.2
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/4dc82259/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index d8f9a3d..d05647c 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.3-SNAPSHOT
+2.3.2
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/4dc82259/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index a1a4f87..8d46761 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml

[spark] Git Push Summary

2018-08-13 Thread jshao

Repository: spark
Updated Tags:  refs/tags/v2.3.2-rc5 [created] 4dc82259d

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: Preparing development version 2.3.3-SNAPSHOT

2018-08-13 Thread jshao

Preparing development version 2.3.3-SNAPSHOT


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/29a04036
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/29a04036
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/29a04036

Branch: refs/heads/branch-2.3
Commit: 29a040361c4de5c6438c909ded9959ccd53e1a7c
Parents: 4dc8225
Author: Saisai Shao 
Authored: Tue Aug 14 02:55:19 2018 +
Committer: Saisai Shao 
Committed: Tue Aug 14 02:55:19 2018 +

--
 R/pkg/DESCRIPTION | 2 +-
 assembly/pom.xml  | 2 +-
 common/kvstore/pom.xml| 2 +-
 common/network-common/pom.xml | 2 +-
 common/network-shuffle/pom.xml| 2 +-
 common/network-yarn/pom.xml   | 2 +-
 common/sketch/pom.xml | 2 +-
 common/tags/pom.xml   | 2 +-
 common/unsafe/pom.xml | 2 +-
 core/pom.xml  | 2 +-
 docs/_config.yml  | 4 ++--
 examples/pom.xml  | 2 +-
 external/docker-integration-tests/pom.xml | 2 +-
 external/flume-assembly/pom.xml   | 2 +-
 external/flume-sink/pom.xml   | 2 +-
 external/flume/pom.xml| 2 +-
 external/kafka-0-10-assembly/pom.xml  | 2 +-
 external/kafka-0-10-sql/pom.xml   | 2 +-
 external/kafka-0-10/pom.xml   | 2 +-
 external/kafka-0-8-assembly/pom.xml   | 2 +-
 external/kafka-0-8/pom.xml| 2 +-
 external/kinesis-asl-assembly/pom.xml | 2 +-
 external/kinesis-asl/pom.xml  | 2 +-
 external/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml| 2 +-
 hadoop-cloud/pom.xml  | 2 +-
 launcher/pom.xml  | 2 +-
 mllib-local/pom.xml   | 2 +-
 mllib/pom.xml | 2 +-
 pom.xml   | 2 +-
 python/pyspark/version.py | 2 +-
 repl/pom.xml  | 2 +-
 resource-managers/kubernetes/core/pom.xml | 2 +-
 resource-managers/mesos/pom.xml   | 2 +-
 resource-managers/yarn/pom.xml| 2 +-
 sql/catalyst/pom.xml  | 2 +-
 sql/core/pom.xml  | 2 +-
 sql/hive-thriftserver/pom.xml | 2 +-
 sql/hive/pom.xml  | 2 +-
 streaming/pom.xml | 2 +-
 tools/pom.xml | 2 +-
 41 files changed, 42 insertions(+), 42 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/29a04036/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 8df2635..6ec4966 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: SparkR
 Type: Package
-Version: 2.3.2
+Version: 2.3.3
 Title: R Frontend for Apache Spark
 Description: Provides an R Frontend for Apache Spark.
 Authors@R: c(person("Shivaram", "Venkataraman", role = c("aut", "cre"),

http://git-wip-us.apache.org/repos/asf/spark/blob/29a04036/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 57485fc..f8b15cc 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.2
+2.3.3-SNAPSHOT
 ../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/29a04036/common/kvstore/pom.xml
--
diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml
index 53e58c2..e412a47 100644
--- a/common/kvstore/pom.xml
+++ b/common/kvstore/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.2
+2.3.3-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/29a04036/common/network-common/pom.xml
--
diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml
index d05647c..d8f9a3d 100644
--- a/common/network-common/pom.xml
+++ b/common/network-common/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3.2
+2.3.3-SNAPSHOT
 ../../pom.xml
   
 

http://git-wip-us.apache.org/repos/asf/spark/blob/29a04036/common/network-shuffle/pom.xml
--
diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml
index 8d46761..a1a4f87 100644
--- a/common/network-shuffle/pom.xml
+++ b/common/network-shuffle/pom.xml
@@ -22,7 +22,7 @@
   
 org.apache.spark
 spark-parent_2.11
-2.3

svn commit: r28702 - /dev/spark/v2.3.2-rc5-bin/

2018-08-13 Thread jshao

Author: jshao
Date: Tue Aug 14 04:02:50 2018
New Revision: 28702

Log:
Apache Spark v2.3.2-rc5

Added:
dev/spark/v2.3.2-rc5-bin/
dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz   (with props)
dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz.asc
dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz.sha512
dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz   (with props)
dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz.asc
dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz.sha512
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-hadoop2.6.tgz   (with props)
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-hadoop2.6.tgz.asc
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-hadoop2.6.tgz.sha512
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-hadoop2.7.tgz   (with props)
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-hadoop2.7.tgz.asc
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-hadoop2.7.tgz.sha512
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-without-hadoop.tgz   (with props)
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-without-hadoop.tgz.asc
dev/spark/v2.3.2-rc5-bin/spark-2.3.2-bin-without-hadoop.tgz.sha512
dev/spark/v2.3.2-rc5-bin/spark-2.3.2.tgz   (with props)
dev/spark/v2.3.2-rc5-bin/spark-2.3.2.tgz.asc
dev/spark/v2.3.2-rc5-bin/spark-2.3.2.tgz.sha512

Added: dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz.asc
==
--- dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz.asc (added)
+++ dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz.asc Tue Aug 14 04:02:50 2018
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIcBAABCgAGBQJbcktnAAoJENsLIaASlz/Qx1AQALpg9+8iDcJ/rW+q4GxLAsBB
+76So/oYAWQSRpj4AeBDnJbfiyVjFsny1x26+IyKLyz90A5G3astBx1j92LpVWqag
+ii4C3u9HyHYmfSriWlAxeJYhDt7MhdsM+Es31Q+uO+3QPB2Up+DuGYA9PzrE/jSA
+QY5NQ+jVGH83KIynMQXHVTbz1MMYQrtwIVOImrBDrf+vgTTm3Whz5xYxMQpVcNDY
+C+VQigGKoqq0rxjJd1lqer3F5KjCqSoHk7xIBBh7C/Kjk3Wv1x6y3O88r3v1WWPe
+Nww/UXhFDD9QKY+8T9TvhW/OEA6dgHm87zko3AXOMaPIHdoyU57L/5uUdICt72iW
+YT7YMdecZgzd7QCU6rneEwZgU6WS1TvdcvAGi8JvAszGNuQeYqKw7c+EkzMiv7Ys
+h3Ymcwq5ODULtQh8UQbiECcpeECmp4h1Vnq9FQDUco3XYEkGesuAUET9wMjCWeqN
+ahR08j/cbcW7yxbOwKpsl4RuSyAqQwQIRkM9GK8g+z091V2MJFfq241Wip2eHZK5
+pakWR8XFemVCqFUppIzrIbIAve5Hk0YRZL/l6bGcZSfKu3aCr3ndges5SJfufuYV
+EKlQjyGnz8o6QsZ+qMi/LRZl5Wxh9eHamn/Eg96H36jYc8I1V5xf1ZGOdVlngK5K
+Dub/tLAYfVPJJfSziOBK
+=6LOG
+-END PGP SIGNATURE-

Added: dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz.sha512
==
--- dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz.sha512 (added)
+++ dev/spark/v2.3.2-rc5-bin/SparkR_2.3.2.tar.gz.sha512 Tue Aug 14 04:02:50 2018
@@ -0,0 +1,3 @@
+SparkR_2.3.2.tar.gz: 5C580581 A27AAEC3 0F1176AD EF0817E7 8D58CE8A 1BAEE405
+ 2CE70766 D3BCCE9B D8531F79 CFAB75E9 59ACF879 A1BAB6A8
+ 2E7EA2AD 37D6742D F57EC3E9 42D964B3

Added: dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz
==
Binary file - no diff available.

Propchange: dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz
--
svn:mime-type = application/octet-stream

Added: dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz.asc
==
--- dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz.asc (added)
+++ dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz.asc Tue Aug 14 04:02:50 2018
@@ -0,0 +1,16 @@
+-BEGIN PGP SIGNATURE-
+
+iQIcBAABCgAGBQJbck+7AAoJENsLIaASlz/Q1KMP/jvhiZw5ImbWJpwOFjkO73G+
+dF3Oo20IZmhiRN16NRJaYO2SzdH3t8HDnEMBC5cbZIR69h+qres5aGN9K1v/DbmH
+88BLUDiSnk7+XXqX+jfQGwowyqE65kj20H6QCGWBsD56m+gbzadtgJ/GMiG9lvKz
+yyERagY0/shKABbXTNyiAtmKI12FR4/L/Y98WDlSs90LEYMHFxDAummHWMqPdyn4
+vF2pMV/7mvthWr7HNyt6cXtBG6KUTszt674VMAeJn5Yt3ZkCpydJslkSsm1WLu9V
+TZ7H5F6R6DlxfopExdu/lGZbINlFmSdKPhzKeX9j0yqzUOjY64obhEJGZNgOl5yU
+/YC/D1u1NTafIb8g2tdzXsJQGI9v3+KmCqgBKBKAcNeEycRNIdvHvswPgz9g+Jzf
+gpMpHLrZHbIv62RmlzERJvd5v+PfT7195ax85Gb+p7k2Zjea0J1oC5iEj+qhRvl/
+Y3kpWd/258s3bLhrv+MUYwzZepLBm3brY/Jbs9N6VEnbEhzQeOHHLj2loIHR1R/W
+CKXHLzHjQCXWvcfBCpmdF9SUGI8ZUSNZrV/96D4T6pmAA1QU3e2RC8N83SOHeAlt
+iEPF/lgeqp6zClV8mKs245cIZt7MRaovPghRWapSfp6XrwomreDDPUcrlJmgpV3h
+e1ronCjB3AvaJ9LOh+IA
+=mxwy
+-END PGP SIGNATURE-

Added: dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz.sha512
==
--- dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz.sha512 (added)
+++ dev/spark/v2.3.2-rc5-bin/pyspark-2.3.2.tar.gz.sha512 Tue Aug 14 04:02:50 
2018
@@ -0,0 +1,3 @@
+

spark git commit: [SPARK-25104][SQL] Avro: Validate user specified output schema

2018-08-13 Thread dbtsai

Repository: spark
Updated Branches:
  refs/heads/master c220cc42a -> ab197308a


[SPARK-25104][SQL] Avro: Validate user specified output schema

## What changes were proposed in this pull request?

With code changes in https://github.com/apache/spark/pull/21847 , Spark can 
write out to Avro file as per user provided output schema.

To make it more robust and user friendly, we should validate the Avro schema 
before tasks launched.

Also we should support output logical decimal type as BYTES (By default we 
output as FIXED)

## How was this patch tested?

Unit test

Closes #22094 from gengliangwang/AvroSerializerMatch.

Authored-by: Gengliang Wang 
Signed-off-by: DB Tsai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ab197308
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ab197308
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ab197308

Branch: refs/heads/master
Commit: ab197308a79c74f0a4205a8f60438811b5e0b991
Parents: c220cc4
Author: Gengliang Wang 
Authored: Tue Aug 14 04:43:14 2018 +
Committer: DB Tsai 
Committed: Tue Aug 14 04:43:14 2018 +

--
 .../apache/spark/sql/avro/AvroSerializer.scala  | 108 +++
 .../spark/sql/avro/AvroLogicalTypeSuite.scala   |  40 +++
 .../org/apache/spark/sql/avro/AvroSuite.scala   |  57 ++
 3 files changed, 158 insertions(+), 47 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ab197308/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala
--
diff --git 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala
index 3a9544c..f551c83 100644
--- 
a/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala
+++ 
b/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala
@@ -26,6 +26,7 @@ import org.apache.avro.Conversions.DecimalConversion
 import org.apache.avro.LogicalTypes.{TimestampMicros, TimestampMillis}
 import org.apache.avro.Schema
 import org.apache.avro.Schema.Type
+import org.apache.avro.Schema.Type._
 import org.apache.avro.generic.GenericData.{EnumSymbol, Fixed, Record}
 import org.apache.avro.generic.GenericData.Record
 import org.apache.avro.util.Utf8
@@ -72,62 +73,70 @@ class AvroSerializer(rootCatalystType: DataType, 
rootAvroType: Schema, nullable:
   private lazy val decimalConversions = new DecimalConversion()
 
   private def newConverter(catalystType: DataType, avroType: Schema): 
Converter = {
-catalystType match {
-  case NullType =>
+(catalystType, avroType.getType) match {
+  case (NullType, NULL) =>
 (getter, ordinal) => null
-  case BooleanType =>
+  case (BooleanType, BOOLEAN) =>
 (getter, ordinal) => getter.getBoolean(ordinal)
-  case ByteType =>
+  case (ByteType, INT) =>
 (getter, ordinal) => getter.getByte(ordinal).toInt
-  case ShortType =>
+  case (ShortType, INT) =>
 (getter, ordinal) => getter.getShort(ordinal).toInt
-  case IntegerType =>
+  case (IntegerType, INT) =>
 (getter, ordinal) => getter.getInt(ordinal)
-  case LongType =>
+  case (LongType, LONG) =>
 (getter, ordinal) => getter.getLong(ordinal)
-  case FloatType =>
+  case (FloatType, FLOAT) =>
 (getter, ordinal) => getter.getFloat(ordinal)
-  case DoubleType =>
+  case (DoubleType, DOUBLE) =>
 (getter, ordinal) => getter.getDouble(ordinal)
-  case d: DecimalType =>
+  case (d: DecimalType, FIXED)
+if avroType.getLogicalType == LogicalTypes.decimal(d.precision, 
d.scale) =>
 (getter, ordinal) =>
   val decimal = getter.getDecimal(ordinal, d.precision, d.scale)
   decimalConversions.toFixed(decimal.toJavaBigDecimal, avroType,
 LogicalTypes.decimal(d.precision, d.scale))
 
-  case StringType => avroType.getType match {
-case Type.ENUM =>
-  import scala.collection.JavaConverters._
-  val enumSymbols: Set[String] = avroType.getEnumSymbols.asScala.toSet
-  (getter, ordinal) =>
-val data = getter.getUTF8String(ordinal).toString
-if (!enumSymbols.contains(data)) {
-  throw new IncompatibleSchemaException(
-"Cannot write \"" + data + "\" since it's not defined in enum 
\"" +
-  enumSymbols.mkString("\", \"") + "\"")
-}
-new EnumSymbol(avroType, data)
-case _ =>
-  (getter, ordinal) => new Utf8(getter.getUTF8String(ordinal).getBytes)
-  }
-  case BinaryType => avroType.getType match {
-case Type.FIXED =>
-  val size = avroType.getFixedSize(

spark git commit: [SPARK-22974][ML] Attach attributes to output column of CountVectorModel

2018-08-13 Thread dbtsai

Repository: spark
Updated Branches:
  refs/heads/master ab197308a -> 3eb52092b


[SPARK-22974][ML] Attach attributes to output column of CountVectorModel

## What changes were proposed in this pull request?

The output column from `CountVectorModel` lacks attribute. So a later 
transformer like `Interaction` can raise error because no attribute available.

## How was this patch tested?

Added test.

Please review http://spark.apache.org/contributing.html before opening a pull 
request.

Closes #20313 from viirya/SPARK-22974.

Authored-by: Liang-Chi Hsieh 
Signed-off-by: DB Tsai 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3eb52092
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3eb52092
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3eb52092

Branch: refs/heads/master
Commit: 3eb52092b3aa9d7d2fc1e50ac237d47bfb3b9e92
Parents: ab19730
Author: Liang-Chi Hsieh 
Authored: Tue Aug 14 05:05:16 2018 +
Committer: DB Tsai 
Committed: Tue Aug 14 05:05:16 2018 +

--
 .../apache/spark/ml/feature/CountVectorizer.scala   |  5 -
 .../spark/ml/feature/CountVectorizerSuite.scala | 16 
 2 files changed, 20 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3eb52092/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
index 10c48c3..dc8eb82 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
@@ -21,6 +21,7 @@ import org.apache.hadoop.fs.Path
 import org.apache.spark.annotation.Since
 import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, 
NumericAttribute}
 import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
 import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
@@ -317,7 +318,9 @@ class CountVectorizerModel(
 
   Vectors.sparse(dictBr.value.size, effectiveCounts)
 }
-dataset.withColumn($(outputCol), vectorizer(col($(inputCol
+val attrs = vocabulary.map(_ => new 
NumericAttribute).asInstanceOf[Array[Attribute]]
+val metadata = new AttributeGroup($(outputCol), attrs).toMetadata()
+dataset.withColumn($(outputCol), vectorizer(col($(inputCol))), metadata)
   }
 
   @Since("1.5.0")

http://git-wip-us.apache.org/repos/asf/spark/blob/3eb52092/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
index 6121766..bca580d 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala
@@ -289,4 +289,20 @@ class CountVectorizerSuite extends MLTest with 
DefaultReadWriteTest {
 val newInstance = testDefaultReadWrite(instance)
 assert(newInstance.vocabulary === instance.vocabulary)
   }
+
+  test("SPARK-22974: CountVectorModel should attach proper attribute to output 
column") {
+val df = spark.createDataFrame(Seq(
+  (0, 1.0, Array("a", "b", "c")),
+  (1, 2.0, Array("a", "b", "b", "c", "a", "d"))
+)).toDF("id", "features1", "words")
+
+val cvm = new CountVectorizerModel(Array("a", "b", "c"))
+  .setInputCol("words")
+  .setOutputCol("features2")
+
+val df1 = cvm.transform(df)
+val interaction = new Interaction().setInputCols(Array("features1", 
"features2"))
+  .setOutputCol("features")
+interaction.transform(df1)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r28704 - in /dev/spark/2.3.3-SNAPSHOT-2018_08_13_22_02-29a0403-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

2018-08-13 Thread pwendell

Author: pwendell
Date: Tue Aug 14 05:16:02 2018
New Revision: 28704

Log:
Apache Spark 2.3.3-SNAPSHOT-2018_08_13_22_02-29a0403 docs


[This commit notification would consist of 1443 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r28707 - in /dev/spark/v2.3.2-rc5-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _site/api/java/org/apache/spark

2018-08-13 Thread jshao

Author: jshao
Date: Tue Aug 14 06:54:52 2018
New Revision: 28707

Log:
Apache Spark v2.3.2-rc5 docs


[This commit notification would consist of 1446 parts, 
which exceeds the limit of 50 ones, so it was shortened to the summary.]

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

svn commit: r28683 - in /dev/spark/2.4.0-SNAPSHOT-2018_08_13_00_02-a992827-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-25096][SQL] Loosen nullability if the cast is force-nullable.

spark git commit: [SPARK-24391][SQL] Support arrays of any types by from_json

spark git commit: [SPARK-25099][SQL][TEST] Generate Avro Binary files in test suite

spark git commit: [SPARK-22713][CORE] ExternalAppendOnlyMap leaks when spilled during iteration

spark git commit: [SPARK-23908][SQL][FOLLOW-UP] Rename inputs to arguments, and add argument type check.

spark-website git commit: Add CVE-2018-11770

svn commit: r28694 - in /dev/spark/2.4.0-SNAPSHOT-2018_08_13_08_02-b804ca5-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark git commit: [SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values

spark git commit: [SPARK-25028][SQL] Avoid NPE when analyzing partition with NULL values

svn commit: r28695 - in /dev/spark/2.3.3-SNAPSHOT-2018_08_13_10_01-787790b-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

svn commit: r28697 - in /dev/spark/2.4.0-SNAPSHOT-2018_08_13_12_01-c220cc4-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

spark-website git commit: Stash pride logo for next year

[1/2] spark git commit: Preparing Spark release v2.3.2-rc5

[spark] Git Push Summary

[2/2] spark git commit: Preparing development version 2.3.3-SNAPSHOT

svn commit: r28702 - /dev/spark/v2.3.2-rc5-bin/

spark git commit: [SPARK-25104][SQL] Avro: Validate user specified output schema

spark git commit: [SPARK-22974][ML] Attach attributes to output column of CountVectorModel

svn commit: r28704 - in /dev/spark/2.3.3-SNAPSHOT-2018_08_13_22_02-29a0403-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _s

svn commit: r28707 - in /dev/spark/v2.3.2-rc5-docs: ./ _site/ _site/api/ _site/api/R/ _site/api/java/ _site/api/java/lib/ _site/api/java/org/ _site/api/java/org/apache/ _site/api/java/org/apache/spark

21 matches

Site Navigation

Mail list logo

Footer information