[GitHub] spark issue #21009: [SPARK-23905][SQL] Add UDF weekday

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89081/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21009: [SPARK-23905][SQL] Add UDF weekday

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21009
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21009: [SPARK-23905][SQL] Add UDF weekday

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21009
  
**[Test build #89081 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89081/testReport)**
 for PR 21009 at commit 
[`79beb00`](https://github.com/apache/spark/commit/79beb00026fb2dd2df295bccd797aaf21746ebdb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class DayOfWeek(child: Expression) extends DayWeek `
  * `case class WeekDay(child: Expression) extends DayWeek `
  * `abstract class DayWeek extends UnaryExpression with 
ImplicitCastInputTypes `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21018: [SPARK-23880][SQL] Do not trigger any jobs for caching d...

2018-04-09 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21018
  
I'll fix the failure tonight (jst)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21013: [WIP][SPARK-23874][SQL][PYTHON] Upgrade Arrow and pyarro...

2018-04-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21013
  
@BryanCutler Yeah, we are unable to upgrade to 0.9 until the regression is 
fixed. Thanks for working on it! 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21013: [WIP][SPARK-23874][SQL][PYTHON] Upgrade Arrow and pyarro...

2018-04-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21013
  
cc @ueshin @yhuai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21020: [MINOR][DOCS] Fix R documentation generation instruction...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21020
  
**[Test build #89087 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89087/testReport)**
 for PR 21020 at commit 
[`a0bba18`](https://github.com/apache/spark/commit/a0bba18ef9a19774e1965867241f56fb0508c579).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21020: [MINOR][DOCS] Fix R documentation generation instruction...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21020
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21020: [MINOR][DOCS] Fix R documentation generation instruction...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21020
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89087/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIndex

2018-04-09 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/21004
  
@cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21018: [SPARK-23880][SQL] Do not trigger any jobs for caching d...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21018
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21018: [SPARK-23880][SQL] Do not trigger any jobs for caching d...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21018
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89083/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21001: [SPARK-19724][SQL][FOLLOW-UP]Check location of ma...

2018-04-09 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21001#discussion_r180297255
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -1459,11 +1459,6 @@ class HiveDDLSuite
   sql(s"ALTER TABLE tbl UNSET TBLPROPERTIES 
('${forbiddenPrefix}foo')")
 }
 assert(e2.getMessage.contains(forbiddenPrefix + "foo"))
-
-val e3 = intercept[AnalysisException] {
-  sql(s"CREATE TABLE tbl (a INT) TBLPROPERTIES 
('${forbiddenPrefix}foo'='anything')")
-}
-assert(e3.getMessage.contains(forbiddenPrefix + "foo"))
--- End diff --

Instead of deleting the test, we should change the name of the table


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21018: [SPARK-23880][SQL] Do not trigger any jobs for caching d...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21018
  
**[Test build #89083 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89083/testReport)**
 for PR 21018 at commit 
[`01d75d7`](https://github.com/apache/spark/commit/01d75d789c45f73bd999106dfc6f29cdc3050ce9).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21020: [MINOR][DOCS] Fix R documentation generation instruction...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21020
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21020: [MINOR][DOCS] Fix R documentation generation instruction...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21020
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2123/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21005: [SPARK-23898][SQL] Simplify add & subtract code g...

2018-04-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21005


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21020: [MINOR][DOCS] Fix R documentation generation instruction...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21020
  
**[Test build #89087 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89087/testReport)**
 for PR 21020 at commit 
[`a0bba18`](https://github.com/apache/spark/commit/a0bba18ef9a19774e1965867241f56fb0508c579).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21020: [MINOR][DOCS] Fix R documentation generation inst...

2018-04-09 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/21020

[MINOR][DOCS] Fix R documentation generation instruction for roxygen2

## What changes were proposed in this pull request?

This PR proposes to fix `roxygen2` to `5.0.1` in `docs/README.md` for 
SparkR documentation generation.

If I use higher version and creates the doc, it shows the diff below. Not a 
big deal but it bothered me.

```diff
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 855eb5bf77f..159fca61e06 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -57,6 +57,6 @@ Collate:
 'types.R'
 'utils.R'
 'window.R'
-RoxygenNote: 5.0.1
+RoxygenNote: 6.0.1
 VignetteBuilder: knitr
 NeedsCompilation: no
```

## How was this patch tested?

Manually tested. I met this every time I set the new environment for Spark 
dev but I have kept forgetting to fix it.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark minor-r-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21020.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21020


commit a0bba18ef9a19774e1965867241f56fb0508c579
Author: hyukjinkwon 
Date:   2018-04-10T04:44:33Z

Fix R documentation generation instruction for roxygen2




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21018: [SPARK-23880][SQL] Do not trigger any jobs for caching d...

2018-04-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21018
  
cc @mengxr @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21020: [MINOR][DOCS] Fix R documentation generation instruction...

2018-04-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21020
  
cc @felixcheung could you take a look when you are available?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21005
  
Thanks! Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21008: [SPARK-23902][SQL] Add roundOff flag to months_between

2018-04-09 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21008
  
> HIVE-15978 introduced the roundOff flag in order to disable the rounding 
to 8 digits which is performed in months_between. Since this can be a 
computational intensive operation, skipping it may improve performances when 
the rounding is not needed.

The PR number is HIVE-15511


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21017
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2122/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21017
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21017
  
**[Test build #89086 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89086/testReport)**
 for PR 21017 at commit 
[`03ca4ec`](https://github.com/apache/spark/commit/03ca4ec3136253d7087bcaa67bdc1401a7de90fc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21017
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL][WIP] Provide an option to igno...

2018-04-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r180291983
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -887,6 +887,14 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val IGNORE_NULL_FIELDS_STREAMING_SCHEMA_INFERENCE =
+buildConf("spark.sql.streaming.schemaInference.ignoreNullFields")
--- End diff --

`ignoreNullFields` is not precise. For example

```json
{"a": 1}
{"a": null}
```

The `null` value is not ignored. I would suggest `dropFieldIfAllNull`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL][WIP] Provide an option to igno...

2018-04-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r180293096
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -624,6 +624,42 @@ class FileStreamSourceSuite extends 
FileStreamSourceTest {
 }
   }
 
+  test("SPARK-23772 Ignore column of all null values or empty array during 
JSON schema inference") {
--- End diff --

We need more test coverage. I have a similar internal implementation that 
tests the following cases (ignore the actual test, just look at the example 
records):

```scala
  test("null") {
assert(removeNullRecursively("null") === "null")
  }

  test("empty string") {
assert(removeNullRecursively("\"\"") === "\"\"")
  }

  test("empty object") {
assert(removeNullRecursively("{}") === "null")
  }

  test("object with all null values") {
val json = """{"a":null,"b":null, "c":null}"""
assert(removeNullRecursively(json) === "null")
  }

  test("object with some null fields") {
val json = """{"a":null,"b":"c","d":null,"e":"f"}"""
val expected = """{"b":"c","e":"f"}"""
assert(removeNullRecursively(json) === expected)
  }

  test("object with some nested null values") {
val json = 
"""{"a":{},"b":{"c":null},"d":{"c":"e"},"f":{"c":null,"g":"h"}}"""
val expected = """{"d":{"c":"e"},"f":{"g":"h"}}"""
assert(removeNullRecursively(json) === expected)
  }

  test("array with all null elements") {
val json = """[null,null,{},{"a":null}]"""
val expected = "null"
assert(removeNullRecursively(json) === expected)
  }

  test("array with some null elements") {
// TODO: is it an issue if we covert empty object to null in an array?
val json = """[null,"a",null,{},"b"]"""
val expected = """[null,"a",null,null,"b"]"""
assert(removeNullRecursively(json) === expected)
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL][WIP] Provide an option to igno...

2018-04-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r180291486
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/TypePlaceholder.scala ---
@@ -0,0 +1,23 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types
+
+/**
+ * An internal type that is a not yet available and will be replaced by an 
actual type later.
+ */
+case object TypePlaceholder extends StringType
--- End diff --

Is it necessary to introduce a new DataType? Would it be the same if we use 
`NullType`? With the flag on, at the end of schema inference, `NullType`, 
`ArrayType(NullType)`, etc should be dropped instead of using StringType as 
fallback. Basically, during schema inference, we keep the one that reveals more 
details, for example:

```
(NullType, ArrayType(NullType)) => ArrayType(NullType)
(ArrayType(NullType), ArrayType(StructType(Field("a", NullType => 
ArrayType(StructType(Field("a", NullType
```

At the end, we implement a util method that determine whether a field is 
all null and drop them if true. It should be done recursively. I have an 
internal implementation that implements a similar logic, but on the JSON record 
itself. You might want to apply it to data types.

```scala
  /**
   * Removes null fields recursively from the input JSON record.
   * An array is null if all its elements are null.
   * An object is null if all its values are null.
   */
  def removeNullRecursively(jsonStr: String): String = {
val json = parse(jsonStr)
val cleaned = doRemoveNullRecursively(json)
compact(render(cleaned)) // should handle null correctly
  }

  private def doRemoveNullRecursively(value: JValue): JValue = {
value match {
  case null =>
null

  case JNull =>
null

  case JArray(values) =>
val cleaned = values.map(doRemoveNullRecursively)
if (cleaned.exists(_ != null)) {
  JArray(cleaned)
} else {
  null
}

  case JObject(pairs) =>
val cleaned = pairs.flatMap { case (k, v) =>
  val cv = doRemoveNullRecursively(v)
  if (cv != null) {
Some((k, cv))
  } else {
None
  }
}
if (cleaned.nonEmpty) {
  JObject(cleaned)
} else {
  null
}

  // all other types are non-null
  case _ =>
value
}
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20929: [SPARK-23772][SQL][WIP] Provide an option to igno...

2018-04-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20929#discussion_r180291939
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -887,6 +887,14 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+  val IGNORE_NULL_FIELDS_STREAMING_SCHEMA_INFERENCE =
--- End diff --

* I was expecting a configuration for the JSON data source instead of a 
Spark SQL conf. Say, how to implement this feature if the data source is 
outside Spark?
* Since Spark SQL cannot control the behavior of external data source 
implementations, the flag won't work unless external data sources recognize it. 
This is another reason to put it under JSON data source.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17092
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17092: [SPARK-18450][ML] Scala API Change for LSH AND-amplifica...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17092
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2121/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21017
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21017
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89082/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21017
  
**[Test build #89082 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89082/testReport)**
 for PR 21017 at commit 
[`03ca4ec`](https://github.com/apache/spark/commit/03ca4ec3136253d7087bcaa67bdc1401a7de90fc).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21016: [SPARK-23947][SQL] Add hashUTF8String convenience...

2018-04-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21016


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData

2018-04-09 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20984
  
cc @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for Arr...

2018-04-09 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20984#discussion_r180292435
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayData.scala 
---
@@ -164,3 +167,32 @@ abstract class ArrayData extends SpecializedGetters 
with Serializable {
 }
   }
 }
+
+class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) 
extends IndexedSeq[T] {
+
+  private lazy val accessor: (Int) => Any = dataType match {
+case BooleanType => (idx: Int) => arrayData.getBoolean(idx)
+case ByteType => (idx: Int) => arrayData.getByte(idx)
+case ShortType => (idx: Int) => arrayData.getShort(idx)
+case IntegerType => (idx: Int) => arrayData.getInt(idx)
+case LongType => (idx: Int) => arrayData.getLong(idx)
+case FloatType => (idx: Int) => arrayData.getFloat(idx)
+case DoubleType => (idx: Int) => arrayData.getDouble(idx)
+case d: DecimalType => (idx: Int) => arrayData.getDecimal(idx, 
d.precision, d.scale)
+case CalendarIntervalType => (idx: Int) => arrayData.getInterval(idx)
+case StringType => (idx: Int) => arrayData.getUTF8String(idx)
+case BinaryType => (idx: Int) => arrayData.getBinary(idx)
+case s: StructType => (idx: Int) => arrayData.getStruct(idx, s.length)
+case _: ArrayType => (idx: Int) => arrayData.getArray(idx)
+case _: MapType => (idx: Int) => arrayData.getMap(idx)
+case _ => (idx: Int) => arrayData.get(idx, dataType)
+  }
+
+  override def apply(idx: Int): T = if (idx < arrayData.numElements()) {
--- End diff --

Thanks. Added the check now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2018-04-09 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19813
  
@cloud-fan Since #20043 was merged now, I will go to polish this and 
implement the above idea. Will submit a PR for this when it is ready.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21001: [SPARK-19724][SQL][FOLLOW-UP]Check location of managed t...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21001
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21001: [SPARK-19724][SQL][FOLLOW-UP]Check location of managed t...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21001
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89080/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21001: [SPARK-19724][SQL][FOLLOW-UP]Check location of managed t...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21001
  
**[Test build #89080 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89080/testReport)**
 for PR 21001 at commit 
[`3f3391b`](https://github.com/apache/spark/commit/3f3391b28a7d1c8fed1df195d6b1a0e27f0a60c4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21019
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2120/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21019
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21019
  
**[Test build #89085 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89085/testReport)**
 for PR 21019 at commit 
[`685124a`](https://github.com/apache/spark/commit/685124a11b789af2a42b4978e25ed404b2a15176).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21019: [SPARK-23948] Trigger mapstage's job listener in ...

2018-04-09 Thread jinxing64
GitHub user jinxing64 opened a pull request:

https://github.com/apache/spark/pull/21019

[SPARK-23948] Trigger mapstage's job listener in submitMissingTasks

## What changes were proposed in this pull request?

SparkContext submitted a map stage from `submitMapStage` to `DAGScheduler`, 
`markMapStageJobAsFinished` is called only in ();

But think about below scenario:
1. stage0 and stage1 are all `ShuffleMapStage` and stage1 depends on stage0;
2. We submit stage1 by `submitMapStage`, there are 10 missing tasks in 
stage1
3. When stage 1 running, `FetchFailed` happened, stage0 and stage1 got 
resubmitted as stage0_1 and stage1_1;
4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, 
but stage1 is not inside `runningStages`. So even though all splits(including 
the speculated tasks) in stage1 succeeded, job listener in stage1 will not be 
called;
5. stage0_1 finished, stage1_1 starts running. When `submitMissingTasks`, 
there is no missing tasks. But in current code, job listener is not triggered

## How was this patch tested?

Not added yet.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jinxing64/spark SPARK-23948

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21019.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21019


commit 685124a11b789af2a42b4978e25ed404b2a15176
Author: jinxing 
Date:   2018-04-10T03:33:02Z

[SPARK-23948] Trigger mapstage's job listener in submitMissingTasks




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20757: [SPARK-23595][SQL] ValidateExternalType should support i...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20757
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2119/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20757: [SPARK-23595][SQL] ValidateExternalType should support i...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20757
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20757: [SPARK-23595][SQL] ValidateExternalType should support i...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20757
  
**[Test build #89084 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89084/testReport)**
 for PR 20757 at commit 
[`a787f9a`](https://github.com/apache/spark/commit/a787f9a782acc132bc38d5f29b0abcb3aba9a739).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20940: [SPARK-23429][CORE] Add executor memory metrics t...

2018-04-09 Thread edwinalu
Github user edwinalu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20940#discussion_r180287725
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -234,8 +244,22 @@ private[spark] class EventLoggingListener(
 }
   }
 
-  // No-op because logging every update would be overkill
-  override def onExecutorMetricsUpdate(event: 
SparkListenerExecutorMetricsUpdate): Unit = { }
+  /**
+   * Log if there is a new peak value for one of the memory metrics for 
the given executor.
+   * Metrics are cleared out when a new stage is started in 
onStageSubmitted, so this will
+   * log new peak memory metric values per executor per stage.
+   */
+  override def onExecutorMetricsUpdate(event: 
SparkListenerExecutorMetricsUpdate): Unit = {
--- End diff --

The original analysis was focused on some test applications and larger 
applications. Looking at a broader sample of our applications, the logging 
overhead is higher, averaging about 14% per application, and 4% overall (sum of 
logging for ExecutorMetricsUpdate / sum of Spark history logs). The overhead 
for larger Spark history logs is mostly pretty small, but increases 
significantly for smaller ones (there's one at 49%).

There's often some logging before the first stage starts, which is extra 
overhead especially for smaller applications/history logs, that doesn't contain 
useful information. It can also be high for the case where the stage takes a 
long time to run and memory is increasing rather than reaching the peak quickly 
-- logging at stage end would work better for this case. 

I should also note that these numbers are for the case where only 4 longs 
are recorded, and with more metrics, the overhead would be higher, both in the 
size of each logged event, and the number of potential peaks, since a new peak 
for any metric would be logged.

Since there will be more metrics added, and the cost is higher than 
originally added, logging at stage end could be a better choice. We would lose 
some information about overlapping stages, but this information wouldn't be 
visible anyway with the currently planned REST APIs or web UI, which just show 
the peaks for stages and executors. 

For logging at stage end, we can log an ExecutorMetricsUpdate event for 
each executor that has sent a heartbeat for the stage just before logging the 
stage end -- this would have the peak value for each metric. This should be the 
minimum amount of logging needed to have information about peak values per 
stage per executor. Alternatively, the information could be added to the 
StageCompleted event for more compaction, but the code would be more 
complicated, with 2 paths for reading in values. Logging an event per executor 
at stage end seems like a reasonable choice, not too much extra logging or too 
much extra complexity.

What are your thoughts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of ...

2018-04-09 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/20997#discussion_r180282891
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala
 ---
@@ -0,0 +1,381 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010
+
+import java.{util => ju}
+
+import scala.collection.JavaConverters._
+
+import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, 
KafkaConsumer}
+import org.apache.kafka.common.{KafkaException, TopicPartition}
+
+import org.apache.spark.TaskContext
+import org.apache.spark.internal.Logging
+
+private[kafka010] sealed trait KafkaDataConsumer[K, V] {
+  /**
+   * Get the record for the given offset if available.
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def get(offset: Long, pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.get(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Start a batch on a compacted topic
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedStart(offset: Long, pollTimeoutMs: Long): Unit = {
+internalConsumer.compactedStart(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Get the next record in the batch from a compacted topic.
+   * Assumes compactedStart has been called first, and ignores gaps.
+   *
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedNext(pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.compactedNext(pollTimeoutMs)
+  }
+
+  /**
+   * Rewind to previous record in the batch from a compacted topic.
+   *
+   * @throws NoSuchElementException if no previous element
+   */
+  def compactedPrevious(): ConsumerRecord[K, V] = {
+internalConsumer.compactedPrevious()
+  }
+
+  /**
+   * Release this consumer from being further used. Depending on its 
implementation,
+   * this consumer will be either finalized, or reset for reuse later.
+   */
+  def release(): Unit
+
+  /** Reference to the internal implementation that this wrapper delegates 
to */
+  protected def internalConsumer: InternalKafkaConsumer[K, V]
+}
+
+
+/**
+ * A wrapper around Kafka's KafkaConsumer.
+ * This is not for direct use outside this file.
+ */
+private[kafka010]
+class InternalKafkaConsumer[K, V](
+  val groupId: String,
+  val topicPartition: TopicPartition,
+  val kafkaParams: ju.Map[String, Object]) extends Logging {
+
+  require(groupId == kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG),
--- End diff --

This is mostly vestigial (there used to be a remove method that took a 
groupId, but no kafkaParams, so there was symmetry).  

I don't see a reason it can't be changed to match the SQL version at this 
point, i.e. assign groupId from the params.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of ...

2018-04-09 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/20997#discussion_r180283864
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala
 ---
@@ -0,0 +1,381 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010
+
+import java.{util => ju}
+
+import scala.collection.JavaConverters._
+
+import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, 
KafkaConsumer}
+import org.apache.kafka.common.{KafkaException, TopicPartition}
+
+import org.apache.spark.TaskContext
+import org.apache.spark.internal.Logging
+
+private[kafka010] sealed trait KafkaDataConsumer[K, V] {
+  /**
+   * Get the record for the given offset if available.
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def get(offset: Long, pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.get(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Start a batch on a compacted topic
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedStart(offset: Long, pollTimeoutMs: Long): Unit = {
+internalConsumer.compactedStart(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Get the next record in the batch from a compacted topic.
+   * Assumes compactedStart has been called first, and ignores gaps.
+   *
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedNext(pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.compactedNext(pollTimeoutMs)
+  }
+
+  /**
+   * Rewind to previous record in the batch from a compacted topic.
+   *
+   * @throws NoSuchElementException if no previous element
+   */
+  def compactedPrevious(): ConsumerRecord[K, V] = {
+internalConsumer.compactedPrevious()
+  }
+
+  /**
+   * Release this consumer from being further used. Depending on its 
implementation,
+   * this consumer will be either finalized, or reset for reuse later.
+   */
+  def release(): Unit
+
+  /** Reference to the internal implementation that this wrapper delegates 
to */
+  protected def internalConsumer: InternalKafkaConsumer[K, V]
+}
+
+
+/**
+ * A wrapper around Kafka's KafkaConsumer.
+ * This is not for direct use outside this file.
+ */
+private[kafka010]
+class InternalKafkaConsumer[K, V](
+  val groupId: String,
+  val topicPartition: TopicPartition,
+  val kafkaParams: ju.Map[String, Object]) extends Logging {
+
+  require(groupId == kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG),
+"groupId used for cache key must match the groupId in kafkaParams")
+
+  @volatile private var consumer = createConsumer
+
+  /** indicates whether this consumer is in use or not */
+  @volatile var inUse = true
+
+  /** indicate whether this consumer is going to be stopped in the next 
release */
+  @volatile var markedForClose = false
+
+  // TODO if the buffer was kept around as a random-access structure,
+  // could possibly optimize re-calculating of an RDD in the same batch
+  @volatile private var buffer = 
ju.Collections.emptyListIterator[ConsumerRecord[K, V]]()
+  @volatile private var nextOffset = InternalKafkaConsumer.UNKNOWN_OFFSET
+
+  override def toString: String = {
+"InternalKafkaConsumer(" +
+  s"hash=${Integer.toHexString(hashCode)}, " +
+  s"groupId=$groupId, " +
+  s"topicPartition=$topicPartition)"
+  }
+
+  /** Create a KafkaConsumer to fetch records for `topicPartition` */
+  private def createConsumer: KafkaConsumer[K, V] = {
+val c = new KafkaConsumer[K, V](kafkaParams)
+val tps = new 

[GitHub] spark pull request #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of ...

2018-04-09 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/20997#discussion_r180284812
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala
 ---
@@ -0,0 +1,381 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010
+
+import java.{util => ju}
+
+import scala.collection.JavaConverters._
+
+import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, 
KafkaConsumer}
+import org.apache.kafka.common.{KafkaException, TopicPartition}
+
+import org.apache.spark.TaskContext
+import org.apache.spark.internal.Logging
+
+private[kafka010] sealed trait KafkaDataConsumer[K, V] {
+  /**
+   * Get the record for the given offset if available.
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def get(offset: Long, pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.get(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Start a batch on a compacted topic
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedStart(offset: Long, pollTimeoutMs: Long): Unit = {
+internalConsumer.compactedStart(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Get the next record in the batch from a compacted topic.
+   * Assumes compactedStart has been called first, and ignores gaps.
+   *
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedNext(pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.compactedNext(pollTimeoutMs)
+  }
+
+  /**
+   * Rewind to previous record in the batch from a compacted topic.
+   *
+   * @throws NoSuchElementException if no previous element
+   */
+  def compactedPrevious(): ConsumerRecord[K, V] = {
+internalConsumer.compactedPrevious()
+  }
+
+  /**
+   * Release this consumer from being further used. Depending on its 
implementation,
+   * this consumer will be either finalized, or reset for reuse later.
+   */
+  def release(): Unit
+
+  /** Reference to the internal implementation that this wrapper delegates 
to */
+  protected def internalConsumer: InternalKafkaConsumer[K, V]
+}
+
+
+/**
+ * A wrapper around Kafka's KafkaConsumer.
+ * This is not for direct use outside this file.
+ */
+private[kafka010]
+class InternalKafkaConsumer[K, V](
+  val groupId: String,
+  val topicPartition: TopicPartition,
+  val kafkaParams: ju.Map[String, Object]) extends Logging {
+
+  require(groupId == kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG),
+"groupId used for cache key must match the groupId in kafkaParams")
+
+  @volatile private var consumer = createConsumer
+
+  /** indicates whether this consumer is in use or not */
+  @volatile var inUse = true
+
+  /** indicate whether this consumer is going to be stopped in the next 
release */
+  @volatile var markedForClose = false
+
+  // TODO if the buffer was kept around as a random-access structure,
+  // could possibly optimize re-calculating of an RDD in the same batch
+  @volatile private var buffer = 
ju.Collections.emptyListIterator[ConsumerRecord[K, V]]()
+  @volatile private var nextOffset = InternalKafkaConsumer.UNKNOWN_OFFSET
+
+  override def toString: String = {
+"InternalKafkaConsumer(" +
+  s"hash=${Integer.toHexString(hashCode)}, " +
+  s"groupId=$groupId, " +
+  s"topicPartition=$topicPartition)"
+  }
+
+  /** Create a KafkaConsumer to fetch records for `topicPartition` */
+  private def createConsumer: KafkaConsumer[K, V] = {
+val c = new KafkaConsumer[K, V](kafkaParams)
+val tps = new 

[GitHub] spark pull request #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of ...

2018-04-09 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/20997#discussion_r180280531
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala
 ---
@@ -0,0 +1,381 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010
+
+import java.{util => ju}
+
+import scala.collection.JavaConverters._
+
+import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, 
KafkaConsumer}
+import org.apache.kafka.common.{KafkaException, TopicPartition}
+
+import org.apache.spark.TaskContext
+import org.apache.spark.internal.Logging
+
+private[kafka010] sealed trait KafkaDataConsumer[K, V] {
+  /**
+   * Get the record for the given offset if available.
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def get(offset: Long, pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.get(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Start a batch on a compacted topic
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedStart(offset: Long, pollTimeoutMs: Long): Unit = {
+internalConsumer.compactedStart(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Get the next record in the batch from a compacted topic.
+   * Assumes compactedStart has been called first, and ignores gaps.
+   *
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedNext(pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.compactedNext(pollTimeoutMs)
+  }
+
+  /**
+   * Rewind to previous record in the batch from a compacted topic.
+   *
+   * @throws NoSuchElementException if no previous element
+   */
+  def compactedPrevious(): ConsumerRecord[K, V] = {
+internalConsumer.compactedPrevious()
+  }
+
+  /**
+   * Release this consumer from being further used. Depending on its 
implementation,
+   * this consumer will be either finalized, or reset for reuse later.
+   */
+  def release(): Unit
+
+  /** Reference to the internal implementation that this wrapper delegates 
to */
+  protected def internalConsumer: InternalKafkaConsumer[K, V]
+}
+
+
+/**
+ * A wrapper around Kafka's KafkaConsumer.
+ * This is not for direct use outside this file.
+ */
+private[kafka010]
+class InternalKafkaConsumer[K, V](
+  val groupId: String,
+  val topicPartition: TopicPartition,
+  val kafkaParams: ju.Map[String, Object]) extends Logging {
+
+  require(groupId == kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG),
+"groupId used for cache key must match the groupId in kafkaParams")
+
+  @volatile private var consumer = createConsumer
--- End diff --

This is an example of the cut & paste I was referring to.

In this case, I don't believe consumer is ever reassigned, so it doesn't 
even need to be a var.

It was reassigned in the SQL version of the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of ...

2018-04-09 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/20997#discussion_r180285599
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala
 ---
@@ -0,0 +1,381 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010
+
+import java.{util => ju}
+
+import scala.collection.JavaConverters._
+
+import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, 
KafkaConsumer}
+import org.apache.kafka.common.{KafkaException, TopicPartition}
+
+import org.apache.spark.TaskContext
+import org.apache.spark.internal.Logging
+
+private[kafka010] sealed trait KafkaDataConsumer[K, V] {
+  /**
+   * Get the record for the given offset if available.
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def get(offset: Long, pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.get(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Start a batch on a compacted topic
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedStart(offset: Long, pollTimeoutMs: Long): Unit = {
+internalConsumer.compactedStart(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Get the next record in the batch from a compacted topic.
+   * Assumes compactedStart has been called first, and ignores gaps.
+   *
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedNext(pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.compactedNext(pollTimeoutMs)
+  }
+
+  /**
+   * Rewind to previous record in the batch from a compacted topic.
+   *
+   * @throws NoSuchElementException if no previous element
+   */
+  def compactedPrevious(): ConsumerRecord[K, V] = {
+internalConsumer.compactedPrevious()
+  }
+
+  /**
+   * Release this consumer from being further used. Depending on its 
implementation,
+   * this consumer will be either finalized, or reset for reuse later.
+   */
+  def release(): Unit
+
+  /** Reference to the internal implementation that this wrapper delegates 
to */
+  protected def internalConsumer: InternalKafkaConsumer[K, V]
+}
+
+
+/**
+ * A wrapper around Kafka's KafkaConsumer.
+ * This is not for direct use outside this file.
+ */
+private[kafka010]
+class InternalKafkaConsumer[K, V](
+  val groupId: String,
+  val topicPartition: TopicPartition,
+  val kafkaParams: ju.Map[String, Object]) extends Logging {
+
+  require(groupId == kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG),
+"groupId used for cache key must match the groupId in kafkaParams")
+
+  @volatile private var consumer = createConsumer
+
+  /** indicates whether this consumer is in use or not */
+  @volatile var inUse = true
+
+  /** indicate whether this consumer is going to be stopped in the next 
release */
+  @volatile var markedForClose = false
+
+  // TODO if the buffer was kept around as a random-access structure,
+  // could possibly optimize re-calculating of an RDD in the same batch
+  @volatile private var buffer = 
ju.Collections.emptyListIterator[ConsumerRecord[K, V]]()
+  @volatile private var nextOffset = InternalKafkaConsumer.UNKNOWN_OFFSET
+
+  override def toString: String = {
+"InternalKafkaConsumer(" +
+  s"hash=${Integer.toHexString(hashCode)}, " +
+  s"groupId=$groupId, " +
+  s"topicPartition=$topicPartition)"
+  }
+
+  /** Create a KafkaConsumer to fetch records for `topicPartition` */
+  private def createConsumer: KafkaConsumer[K, V] = {
+val c = new KafkaConsumer[K, V](kafkaParams)
+val tps = new 

[GitHub] spark pull request #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of ...

2018-04-09 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/20997#discussion_r180280210
  
--- Diff: 
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumerSuite.scala
 ---
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010
+
+import java.util.concurrent.{Executors, TimeUnit}
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+import org.apache.kafka.clients.consumer.ConsumerConfig._
+import org.apache.kafka.common.TopicPartition
+import org.apache.kafka.common.serialization.ByteArrayDeserializer
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark._
+
+class KafkaDataConsumerSuite extends SparkFunSuite with BeforeAndAfterAll {
+
+  private var testUtils: KafkaTestUtils = _
+
+  override def beforeAll {
+super.beforeAll()
+testUtils = new KafkaTestUtils
+testUtils.setup()
+  }
+
+  override def afterAll {
+if (testUtils != null) {
+  testUtils.teardown()
+  testUtils = null
+}
+super.afterAll()
+  }
+
+  test("concurrent use of KafkaDataConsumer") {
+KafkaDataConsumer.init(16, 64, 0.75f)
+
+val topic = "topic" + Random.nextInt()
+val data = (1 to 1000).map(_.toString)
+val topicPartition = new TopicPartition(topic, 0)
+testUtils.createTopic(topic)
+testUtils.sendMessages(topic, data.toArray)
+
+val groupId = "groupId"
+val kafkaParams = Map[String, Object](
+  GROUP_ID_CONFIG -> groupId,
+  BOOTSTRAP_SERVERS_CONFIG -> testUtils.brokerAddress,
+  KEY_DESERIALIZER_CLASS_CONFIG -> 
classOf[ByteArrayDeserializer].getName,
+  VALUE_DESERIALIZER_CLASS_CONFIG -> 
classOf[ByteArrayDeserializer].getName,
+  AUTO_OFFSET_RESET_CONFIG -> "earliest",
+  ENABLE_AUTO_COMMIT_CONFIG -> "false"
+)
+
+val numThreads = 100
+val numConsumerUsages = 500
+
+@volatile var error: Throwable = null
+
+def consume(i: Int): Unit = {
+  val useCache = Random.nextBoolean
+  val taskContext = if (Random.nextBoolean) {
+new TaskContextImpl(0, 0, 0, 0, attemptNumber = Random.nextInt(2), 
null, null, null)
+  } else {
+null
+  }
+  val consumer = KafkaDataConsumer.acquire[Array[Byte], Array[Byte]](
+groupId, topicPartition, kafkaParams.asJava, taskContext, useCache)
+  try {
+val rcvd = 0 until data.length map { offset =>
+  val bytes = consumer.get(offset, 1).value()
+  new String(bytes)
+}
+assert(rcvd == data)
+  } catch {
+case e: Throwable =>
+  error = e
+  throw e
+  } finally {
+consumer.release()
+  }
+}
+
+val threadPool = Executors.newFixedThreadPool(numThreads)
+try {
+  val futures = (1 to numConsumerUsages).map { i =>
+threadPool.submit(new Runnable {
+  override def run(): Unit = { consume(i) }
+})
+  }
+  futures.foreach(_.get(1, TimeUnit.MINUTES))
+  assert(error == null)
+} finally {
+  threadPool.shutdown()
+}
+  }
+}
--- End diff --

If this PR is intended to fix a problem with silent reading of incorrect 
data, can you add a test reproducing that?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of ...

2018-04-09 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/20997#discussion_r180283733
  
--- Diff: 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaDataConsumer.scala
 ---
@@ -0,0 +1,381 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka010
+
+import java.{util => ju}
+
+import scala.collection.JavaConverters._
+
+import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord, 
KafkaConsumer}
+import org.apache.kafka.common.{KafkaException, TopicPartition}
+
+import org.apache.spark.TaskContext
+import org.apache.spark.internal.Logging
+
+private[kafka010] sealed trait KafkaDataConsumer[K, V] {
+  /**
+   * Get the record for the given offset if available.
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def get(offset: Long, pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.get(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Start a batch on a compacted topic
+   *
+   * @param offset the offset to fetch.
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedStart(offset: Long, pollTimeoutMs: Long): Unit = {
+internalConsumer.compactedStart(offset, pollTimeoutMs)
+  }
+
+  /**
+   * Get the next record in the batch from a compacted topic.
+   * Assumes compactedStart has been called first, and ignores gaps.
+   *
+   * @param pollTimeoutMs  timeout in milliseconds to poll data from Kafka.
+   */
+  def compactedNext(pollTimeoutMs: Long): ConsumerRecord[K, V] = {
+internalConsumer.compactedNext(pollTimeoutMs)
+  }
+
+  /**
+   * Rewind to previous record in the batch from a compacted topic.
+   *
+   * @throws NoSuchElementException if no previous element
+   */
+  def compactedPrevious(): ConsumerRecord[K, V] = {
+internalConsumer.compactedPrevious()
+  }
+
+  /**
+   * Release this consumer from being further used. Depending on its 
implementation,
+   * this consumer will be either finalized, or reset for reuse later.
+   */
+  def release(): Unit
+
+  /** Reference to the internal implementation that this wrapper delegates 
to */
+  protected def internalConsumer: InternalKafkaConsumer[K, V]
+}
+
+
+/**
+ * A wrapper around Kafka's KafkaConsumer.
+ * This is not for direct use outside this file.
+ */
+private[kafka010]
+class InternalKafkaConsumer[K, V](
+  val groupId: String,
+  val topicPartition: TopicPartition,
+  val kafkaParams: ju.Map[String, Object]) extends Logging {
+
+  require(groupId == kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG),
+"groupId used for cache key must match the groupId in kafkaParams")
+
+  @volatile private var consumer = createConsumer
+
+  /** indicates whether this consumer is in use or not */
+  @volatile var inUse = true
+
+  /** indicate whether this consumer is going to be stopped in the next 
release */
+  @volatile var markedForClose = false
+
+  // TODO if the buffer was kept around as a random-access structure,
+  // could possibly optimize re-calculating of an RDD in the same batch
+  @volatile private var buffer = 
ju.Collections.emptyListIterator[ConsumerRecord[K, V]]()
+  @volatile private var nextOffset = InternalKafkaConsumer.UNKNOWN_OFFSET
+
+  override def toString: String = {
+"InternalKafkaConsumer(" +
+  s"hash=${Integer.toHexString(hashCode)}, " +
+  s"groupId=$groupId, " +
+  s"topicPartition=$topicPartition)"
+  }
+
+  /** Create a KafkaConsumer to fetch records for `topicPartition` */
+  private def createConsumer: KafkaConsumer[K, V] = {
+val c = new KafkaConsumer[K, V](kafkaParams)
+val tps = new 

[GitHub] spark issue #21018: [SPARK-23880][SQL] Do not trigger any jobs for caching d...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21018
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2118/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21018: [SPARK-23880][SQL] Do not trigger any jobs for caching d...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21018
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21018: [SPARK-23880][SQL] Do not trigger any jobs for caching d...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21018
  
**[Test build #89083 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89083/testReport)**
 for PR 21018 at commit 
[`01d75d7`](https://github.com/apache/spark/commit/01d75d789c45f73bd999106dfc6f29cdc3050ce9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21018: [SPARK-23880][SQL] Do not trigger any jobs for ca...

2018-04-09 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/21018

[SPARK-23880][SQL] Do not trigger any jobs for caching data

## What changes were proposed in this pull request?
This pr fixed code so that `cache` could prevent any jobs from being 
triggered.
For example, in the current master, an operation below triggers a actual 
job;
```
val df = spark.range(100L)
  .filter('id > 1000)
  .orderBy('id.desc)
  .cache()
```
This triggers a job while the cache should be lazy. The problem is that, 
when creating `InMemoryRelation`, we build the RDD, which calls 
`SparkPlan.execute` and may trigger jobs, like sampling job for range 
partitioner, or broadcast job.

This fix do not build a `RDD` in the constructor of `InMemoryRelation`. 
Then, `InMemoryTableScanExec` materializes the cache and updates the entry in 
`CacheManager`. 

## How was this patch tested?
Added tests in `CachedTableSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark SPARK-23880

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21018


commit 01d75d789c45f73bd999106dfc6f29cdc3050ce9
Author: Takeshi Yamamuro 
Date:   2018-04-09T09:30:10Z

Fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21016: [SPARK-23947][SQL] Add hashUTF8String convenience method...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21016
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21016: [SPARK-23947][SQL] Add hashUTF8String convenience method...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21016
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89078/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21016: [SPARK-23947][SQL] Add hashUTF8String convenience method...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21016
  
**[Test build #89078 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89078/testReport)**
 for PR 21016 at commit 
[`ead91a4`](https://github.com/apache/spark/commit/ead91a453fb6a3f92255d14d37c1d4a6beac22fd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21017
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21017
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2117/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21017: [SPARK-23748][SS] Fix SS continuous process doesn't supp...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21017
  
**[Test build #89082 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89082/testReport)**
 for PR 21017 at commit 
[`03ca4ec`](https://github.com/apache/spark/commit/03ca4ec3136253d7087bcaa67bdc1401a7de90fc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21017: [SPARK-23748][SS] Fix SS continuous process doesn...

2018-04-09 Thread jerryshao
GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/21017

[SPARK-23748][SS] Fix SS continuous process doesn't support SubqueryAlias 
issue

## What changes were proposed in this pull request?

Current SS continuous doesn't support processing on temp table or 
`df.as("xxx")`, SS will throw an exception as LogicalPlan not supported, 
details described in [here](https://issues.apache.org/jira/browse/SPARK-23748).

So here propose to add this support.

## How was this patch tested?

new UT.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-23748

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21017.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21017


commit 03ca4ec3136253d7087bcaa67bdc1401a7de90fc
Author: jerryshao 
Date:   2018-04-10T01:47:05Z

Fix SS continuous process doesn't support SubqueryAlias issue




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of cached ...

2018-04-09 Thread koeninger
Github user koeninger commented on the issue:

https://github.com/apache/spark/pull/20997
  
In general, 2 things about this make me uncomfortable:

- It's basically a cut-and-paste of the SQL equivalent PR, 
https://github.com/apache/spark/pull/20767, but it is different from both that 
PR and the existing DStream code.

- I don't see an upper bound on the number of consumers per key, nor a way 
of reaping idle consumers.  If the SQL equivalent code is likely to be modified 
to use pooling of some kind, seems better to make a consistent decision.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21009: [SPARK-23905][SQL] Add UDF weekday

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21009
  
**[Test build #89081 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89081/testReport)**
 for PR 21009 at commit 
[`79beb00`](https://github.com/apache/spark/commit/79beb00026fb2dd2df295bccd797aaf21746ebdb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21009: [SPARK-23905][SQL] Add UDF weekday

2018-04-09 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21009
  
Jenkins, test this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21009: [SPARK-23905][SQL] Add UDF weekday

2018-04-09 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/21009
  
Jenkins, add to whitelist.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20952: [SPARK-6951][core] Speed up parsing of event logs during...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20952
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89076/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20952: [SPARK-6951][core] Speed up parsing of event logs during...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20952
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20952: [SPARK-6951][core] Speed up parsing of event logs during...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20952
  
**[Test build #89076 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89076/testReport)**
 for PR 20952 at commit 
[`3a5563b`](https://github.com/apache/spark/commit/3a5563bec7f5f4063b1d32c0af653b13b54186c2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as a...

2018-04-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21007
  
cc @cloud-fan and @viirya (from checking the history).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20989: [SPARK-23529][K8s] Support mounting hostPath volumes for...

2018-04-09 Thread madanadit
Github user madanadit commented on the issue:

https://github.com/apache/spark/pull/20989
  
@foxish The failure seems to be because of `SparkRemoteFileTest` missing 
from branch-2.3 (only present in master branch). Which branch do you recommend 
targeting this PR towards?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21001: [SPARK-19724][SQL][FOLLOW-UP]Check location of managed t...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21001
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2116/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21001: [SPARK-19724][SQL][FOLLOW-UP]Check location of managed t...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21001
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21001: [SPARK-19724][SQL][FOLLOW-UP]Check location of managed t...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21001
  
**[Test build #89080 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89080/testReport)**
 for PR 21001 at commit 
[`3f3391b`](https://github.com/apache/spark/commit/3f3391b28a7d1c8fed1df195d6b1a0e27f0a60c4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21015: [SPARK-23944][ML] Add the set method for the two LSHMode...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21015
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89079/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21015: [SPARK-23944][ML] Add the set method for the two LSHMode...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21015
  
**[Test build #89079 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89079/testReport)**
 for PR 21015 at commit 
[`9f16ea6`](https://github.com/apache/spark/commit/9f16ea6f15572a8d189fe537844487abdea797b4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21015: [SPARK-23944][ML] Add the set method for the two LSHMode...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21015
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21005
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89073/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21005
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21005: [SPARK-23898][SQL] Simplify add & subtract code generati...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21005
  
**[Test build #89073 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89073/testReport)**
 for PR 21005 at commit 
[`433`](https://github.com/apache/spark/commit/43314b1d443fac5ca27ecef80677dbe70ab7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20925: [SPARK-22941][core] Do not exit JVM when submit fails wi...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20925
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20925: [SPARK-22941][core] Do not exit JVM when submit fails wi...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20925
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89074/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20925: [SPARK-22941][core] Do not exit JVM when submit fails wi...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20925
  
**[Test build #89074 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89074/testReport)**
 for PR 20925 at commit 
[`262bad8`](https://github.com/apache/spark/commit/262bad88a6d4d6c2513d6da3b2b52e86cd3f5b70).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21015: [SPARK-23944][ML] Add the set method for the two LSHMode...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21015
  
**[Test build #89079 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89079/testReport)**
 for PR 21015 at commit 
[`9f16ea6`](https://github.com/apache/spark/commit/9f16ea6f15572a8d189fe537844487abdea797b4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21015: [SPARK-23944][ML] Add the set method for the two LSHMode...

2018-04-09 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/21015
  
add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20992: [SPARK-23779][SQL] TaskMemoryManager and UnsafeSorter re...

2018-04-09 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/20992
  
@kiszk I did not look at the PR in detail - was curious about the commit 
saying 'address review comment'; but did not see any comments. Were they 
removed by submitter ? Or is github acting up ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

2018-04-09 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/20998
  
Adding isRunning can cause a single 'bad' node (from task pov - not 
necessarily only bad hardware: just that task fails on node) can keep tasks to 
fail repeatedly causing app to exit.

Particularly with blacklist'ing, I am not very sure how the interactions 
will play out .. @squito might have more comments.
In general, this is not a benign change imo and can have non trivial side 
effects.

In the specific usecase of only two machines, it is an unfortunate side 
effect.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21016: [SPARK-23947][SQL] Add hashUTF8String convenience method...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21016
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21016: [SPARK-23947][SQL] Add hashUTF8String convenience method...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21016
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2115/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20043: [SPARK-22856][SQL] Add wrappers for codegen output and n...

2018-04-09 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20043
  
Thanks! @cloud-fan @hvanhovell @kiszk @HyukjinKwon @mgaido91 @maropu 
@rednaxelafx @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21016: [SPARK-23947][SQL] Add hashUTF8String convenience method...

2018-04-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21016
  
**[Test build #89078 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89078/testReport)**
 for PR 21016 at commit 
[`ead91a4`](https://github.com/apache/spark/commit/ead91a453fb6a3f92255d14d37c1d4a6beac22fd).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21016: [SPARK-23947][SQL] Add hashUTF8String convenience...

2018-04-09 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/21016

[SPARK-23947][SQL] Add hashUTF8String convenience method to hasher classes

## What changes were proposed in this pull request?

Add `hashUTF8String()` to the hasher classes to allow Spark SQL codegen to 
generate cleaner code for hashing `UTF8String`s. No change in behavior 
otherwise.

Although with the introduction of SPARK-10399, the code size for hashing 
`UTF8String` is already smaller, it's still good to extract a separate function 
in the hasher classes so that the generated code can stay clean.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rednaxelafx/apache-spark hashutf8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21016.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21016


commit ead91a453fb6a3f92255d14d37c1d4a6beac22fd
Author: Kris Mok 
Date:   2018-04-09T22:30:03Z

SPARK-23947 - Add hashUTF8String convenience method to hasher classes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21015: [SPARK-23944][ML] Add the set method for the two LSHMode...

2018-04-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21015
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >