[GitHub] spark issue #21221: [SPARK-23429][CORE] Add executor memory metrics to heart...

2018-08-15 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21221
  
Jenkins, retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r210492311
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -216,8 +217,7 @@ private[spark] class Executor(
 
   def stop(): Unit = {
 env.metricsSystem.report()
-heartbeater.shutdown()
-heartbeater.awaitTermination(10, TimeUnit.SECONDS)
+heartbeater.stop()
--- End diff --

future: `try {} catch {  case NonFatal(e)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...

2018-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21221#discussion_r210492513
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -69,6 +69,11 @@ package object config {
 .bytesConf(ByteUnit.KiB)
 .createWithDefaultString("100k")
 
+  private[spark] val EVENT_LOG_STAGE_EXECUTOR_METRICS =
+ConfigBuilder("spark.eventLog.logStageExecutorMetrics.enabled")
+  .booleanConf
+  .createWithDefault(true)
--- End diff --

should this be "false" for now until we could test this out more, just to 
be on the safe side?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...

2018-08-15 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/22110#discussion_r210493260
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala 
---
@@ -73,4 +73,14 @@ object TypeUtils {
 }
 x.length - y.length
   }
+
+  /**
+   * Returns true if elements of the data type could be used as items of a 
hash set or as keys
+   * of a hash map.
+   */
+  def typeCanBeHashed(dataType: DataType): Boolean = dataType match {
--- End diff --

I will change it :)

Just one question to ```hashCode```. If ```case classes``` are used, 
```equals``` and ```hashCode``` are generated by compiler. But if we define 
```equals``` manually, shouldn't also hold ```a.equals(b) == true``` => 
```a.hashCode == b.hashCode```?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21537
  
For 2. and 3., it is harder to say my opinion in the comment. Let me say 
short comments at first.

For 2., if I remember correctly, @viirya once wrote the API document in a 
JIRA entry. it would be good to update and add some thoughts about design as a 
first step. I understand that it is hard to keep the up-to-date design 
document, in particular, during the open-source development. This is because we 
have a lot of excellent comments in a PR. 

For 3., at the early implementation of SQL codegen (i.e. use `s""` to 
represent code), I thought there are two problems.
1. lack of type of an expression (in other words, `ExprCode` did not have 
the type of `value`)
2. lack of a structure of statements

Now, we meet a problem that it is hard to rewrite a method argument due to 
problem 1. In my understanding, the current effort led by @viirya is trying to 
resolve problem 1.
It is hard to rewrite a set of statements due to Problem 2. To resolve 
problem 2, we need more effort. In my opinion, we are addressing them step by 
step.

Of course, it would be happy for me to co-lead a discussion of the IR 
design for 2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22119
  
CSV is loaded by default, while AVRO is not. So having a backward 
compatibility mapping in CSV  only still makes sense. 
Let's remove the mapping for CSV in 3.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21537
  
If there's a bug, then let's fix in another JIRA. If that's impossible to 
fix or sounds super risky and there's something I missed, let's revert.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21537
  
I wouldn't revert this unless there are specific concerns about this. Do 
you see any bug by a mixture 
 of representation `s""` and `code""`? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22119
  
For details, see the discussion in the JIRA 
https://issues.apache.org/jira/browse/SPARK-24924


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22119
  
If we all agree this databricks mapping is not reasonable, I think it's ok 
to have this inconsistency and remove the mapping for CSV in 3.0.

It's weird to make the same mistake just to make things consistent. (again, 
if we agree this is a mistake)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22111: [SPARK-25123][SQL] Use Block to track code in SimpleExpr...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22111
  
Let us hold these codegen PRs until we see the design doc for building IR 
for the codegen?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databr...

2018-08-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22119#discussion_r210491239
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -637,6 +635,12 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
+} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
--- End diff --

do we have the same check for kafka?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22098: [SPARK-24886][INFRA] Fix the testing script to increase ...

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22098
  
@shaneknapp, seems this was first introduced in 
https://issues.apache.org/jira/browse/SPARK-3076 / 
https://github.com/apache/spark/pull/1974 for a good reason fwiw.

One thing I am not sure is if we need to have env or not .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databr...

2018-08-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22119#discussion_r210491110
  
--- Diff: 
external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala ---
@@ -503,7 +495,7 @@ class AvroSuite extends QueryTest with SharedSQLContext 
with SQLTestUtils {
 // get the same values back.
 withTempPath { tempDir =>
   val name = "AvroTest"
-  val namespace = "org.apache.spark.avro"
+  val namespace = "com.databricks.spark.avro"
--- End diff --

why change the name space in test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21889
  
**[Test build #4278 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4278/testReport)**
 for PR 21889 at commit 
[`8d822ee`](https://github.com/apache/spark/commit/8d822eea805e1b2dc40b866ca8ac4893e53ad51b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21320
  
**[Test build #4277 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4277/testReport)**
 for PR 21320 at commit 
[`0e5594b`](https://github.com/apache/spark/commit/0e5594b6ac1dcb94f3f0166e66a7d4e7eae3d00c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21537
  
Thank for involving me in an important thread. I was busy this morning in 
Japan.

I think there are three topics in the thread.
1. Merge or revert this PR
2. Design document
3. IR design

For 1., in short, my opinion is likely to revert this PR from the view like 
a release manager.

As we know, it is a time to cut a release branch. This PR partially 
introduce a new representation. If there would be a bug in code generation at 
Spark 2.4, it may introduce a complication of debugging and fixing.
As @mgaido91 pointed out, #20043 and #21026 have been merged. I think that 
they are a kind of refactoring (e.g. change the representation of literal, 
class, and so on). Thus, these two PRs can be there. However, this PR 
introduces a mixture of representation `s""` and `code""`.

WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...

2018-08-15 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22107#discussion_r210490145
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2848,6 +2848,35 @@ setMethod("intersect",
 dataFrame(intersected)
   })
 
+#' intersectAll
+#'
+#' Return a new SparkDataFrame containing rows in both this SparkDataFrame
+#' and another SparkDataFrame while preserving the duplicates.
+#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in
+#' SQL, this function resolves columns by position (not by name).
+#'
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the intersect all 
operation.
+#' @family SparkDataFrame functions
+#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method
+#' @rdname intersectAll
+#' @name intersectAll
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- read.json(path)
+#' df2 <- read.json(path2)
+#' intersectAllDF <- intersectAll(df1, df2)
+#' }
+#' @rdname intersectAll
+#' @note intersectAll since 2.4.0
+setMethod("intersectAll",
+  signature(x = "SparkDataFrame", y = "SparkDataFrame"),
+  function(x, y) {
+intersected <- callJMethod(x@sdf, "intersectAll", y@sdf)
+dataFrame(intersected)
+  })
--- End diff --

@felixcheung OK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...

2018-08-15 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22107#discussion_r210490166
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2876,6 +2905,37 @@ setMethod("except",
 dataFrame(excepted)
   })
 
+#' exceptAll
+#'
+#' Return a new SparkDataFrame containing rows in this SparkDataFrame
+#' but not in another SparkDataFrame while preserving the duplicates.
+#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in
+#' SQL, this function resolves columns by position (not by name).
+#'
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the except all 
operation.
+#' @family SparkDataFrame functions
+#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method
+#' @rdname exceptAll
+#' @name exceptAll
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- read.json(path)
+#' df2 <- read.json(path2)
+#' exceptAllDF <- exceptAll(df1, df2)
+#' }
+#' @rdname exceptAll
+#' @note exceptAll since 2.4.0
+setMethod("exceptAll",
+  signature(x = "SparkDataFrame", y = "SparkDataFrame"),
+  function(x, y) {
+excepted <- callJMethod(x@sdf, "exceptAll", y@sdf)
+dataFrame(excepted)
+  })
+
--- End diff --

@felixcheung Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22119
  
Sorry if I missed some comments somewhere but just for clarification, 
should we do it for CSV in 3.0.0?  Inconsistency should also be taken into 
account. Actually configuration sounds making more sense to me and remove it in 
3.0.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...

2018-08-15 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22107#discussion_r210490074
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2876,6 +2905,37 @@ setMethod("except",
 dataFrame(excepted)
   })
 
+#' exceptAll
+#'
+#' Return a new SparkDataFrame containing rows in this SparkDataFrame
+#' but not in another SparkDataFrame while preserving the duplicates.
+#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in
+#' SQL, this function resolves columns by position (not by name).
+#'
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the except all 
operation.
+#' @family SparkDataFrame functions
+#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method
+#' @rdname exceptAll
+#' @name exceptAll
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- read.json(path)
+#' df2 <- read.json(path2)
+#' exceptAllDF <- exceptAll(df1, df2)
+#' }
+#' @rdname exceptAll
--- End diff --

@felixcheung Thanks .. Did you want the original function `except` fixed at 
part of this ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...

2018-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21835#discussion_r210489980
  
--- Diff: R/pkg/R/functions.R ---
@@ -3320,7 +3321,7 @@ setMethod("explode",
 #' @aliases sequence sequence,Column-method
 #' @note sequence since 2.4.0
 setMethod("sequence",
- signature(x = "Column", y = "Column"),
+ signature(),
--- End diff --

sorry, I didn't see the reply.
yes, we should try to make sequence callable. 

we shouldn't have to manually call it though and it is better to rely on R 
internal type/call routing. it's a bit hard to explain but check out

`attach` `setGeneric("attach")`
or
`str` `setGeneric("str")`

if you see what I mean.

also we should avoid `signature()` empty as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22119
  
**[Test build #94841 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94841/testReport)**
 for PR 22119 at commit 
[`656790e`](https://github.com/apache/spark/commit/656790e7cc05ea1fd31e768ed1a05c18b7881de6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22119
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22119
  
@tgravescs @dongjoon-hyun @HyukjinKwon @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databr...

2018-08-15 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22119#discussion_r210489530
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -637,6 +635,12 @@ object DataSource extends Logging {
 "Hive built-in ORC data source must be used with Hive 
support enabled. " +
 "Please use the native ORC data source by setting 
'spark.sql.orc.impl' to " +
 "'native'")
+} else if (provider1.toLowerCase(Locale.ROOT) == "avro" ||
+  provider1 == "com.databricks.spark.avro") {
+  throw new AnalysisException(
+s"Failed to find data source: 
${provider1.toLowerCase(Locale.ROOT)}. " +
+"AVRO is built-in data source since Spark 2.4. Please 
deploy the application " +
+"as per 
https://spark.apache.org/docs/latest/avro-data-source.html#deploying";)
--- End diff --

I am creating a documentation for AVRO data source. Let's merge this PR 
after the README is done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22119
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2239/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20725: [SPARK-23555][PYTHON] Add BinaryType support for ...

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20725#discussion_r210489614
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -4331,13 +4354,22 @@ def test_createDataFrame_fallback_enabled(self):
 self.assertEqual(df.collect(), [Row(a={u'a': 1})])
 
 def test_createDataFrame_fallback_disabled(self):
+from distutils.version import LooseVersion
 import pandas as pd
+import pyarrow as pa
 
 with QuietTest(self.sc):
 with self.assertRaisesRegexp(TypeError, 'Unsupported type'):
 self.spark.createDataFrame(
 pd.DataFrame([[{u'a': 1}]]), "a: map")
 
+# TODO: remove BinaryType check once minimum pyarrow version is 
0.10.0
+if LooseVersion(pa.__version__) < LooseVersion("0.10.0"):
+with QuietTest(self.sc):
+with self.assertRaisesRegexp(TypeError, 'Unsupported 
type.*BinaryType'):
+self.spark.createDataFrame(
+pd.DataFrame([[{'a': b'aaa'}]]), "a: binary")
--- End diff --

@BryanCutler, BTW, what kind of data can be binary? `bytearray` and `byte` 
(`str` in Python 2)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databr...

2018-08-15 Thread gengliangwang
GitHub user gengliangwang opened a pull request:

https://github.com/apache/spark/pull/22119

[WIP][SPARK-25129][SQL] Revert mapping com.databricks.spark.avro to 
org.apache.spark.sql.avro

## What changes were proposed in this pull request?

In https://issues.apache.org/jira/browse/SPARK-24924, the data source 
provider com.databricks.spark.avro is mapped to the new package 
org.apache.spark.sql.avro .

Avro is external module and not loaded by default, we should not prevent 
users from using "com.databricks.spark.avro".

## How was this patch tested?

Unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gengliangwang/spark revert_avro_mapping

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22119.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22119


commit 4584b2db0e39a4a67685110e35a243757a858319
Author: Gengliang Wang 
Date:   2018-08-16T04:57:04Z

Revert "[SPARK-24924][SQL] Add mapping for built-in Avro data source"

This reverts commit 58353d7f4baa8102c3d2f4777a5c407f14993306.

commit 656790e7cc05ea1fd31e768ed1a05c18b7881de6
Author: Gengliang Wang 
Date:   2018-08-16T05:42:19Z

improve error message




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22117
  
**[Test build #94840 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94840/testReport)**
 for PR 22117 at commit 
[`3cad78f`](https://github.com/apache/spark/commit/3cad78f8bb9bc0dc841cd0c31e0b0d52f8e7c764).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...

2018-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22107#discussion_r210488842
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2848,6 +2848,35 @@ setMethod("intersect",
 dataFrame(intersected)
   })
 
+#' intersectAll
+#'
+#' Return a new SparkDataFrame containing rows in both this SparkDataFrame
+#' and another SparkDataFrame while preserving the duplicates.
+#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in
+#' SQL, this function resolves columns by position (not by name).
+#'
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the intersect all 
operation.
+#' @family SparkDataFrame functions
+#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method
+#' @rdname intersectAll
+#' @name intersectAll
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- read.json(path)
+#' df2 <- read.json(path2)
+#' intersectAllDF <- intersectAll(df1, df2)
+#' }
+#' @rdname intersectAll
+#' @note intersectAll since 2.4.0
+setMethod("intersectAll",
+  signature(x = "SparkDataFrame", y = "SparkDataFrame"),
+  function(x, y) {
+intersected <- callJMethod(x@sdf, "intersectAll", y@sdf)
+dataFrame(intersected)
+  })
--- End diff --

add extra empty line after code


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...

2018-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22107#discussion_r210488890
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2876,6 +2905,37 @@ setMethod("except",
 dataFrame(excepted)
   })
 
+#' exceptAll
+#'
+#' Return a new SparkDataFrame containing rows in this SparkDataFrame
+#' but not in another SparkDataFrame while preserving the duplicates.
+#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in
+#' SQL, this function resolves columns by position (not by name).
+#'
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the except all 
operation.
+#' @family SparkDataFrame functions
+#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method
+#' @rdname exceptAll
+#' @name exceptAll
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- read.json(path)
+#' df2 <- read.json(path2)
+#' exceptAllDF <- exceptAll(df1, df2)
+#' }
+#' @rdname exceptAll
+#' @note exceptAll since 2.4.0
+setMethod("exceptAll",
+  signature(x = "SparkDataFrame", y = "SparkDataFrame"),
+  function(x, y) {
+excepted <- callJMethod(x@sdf, "exceptAll", y@sdf)
+dataFrame(excepted)
+  })
+
--- End diff --

nit: remove one of the two empty lines


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22117
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22117
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2238/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...

2018-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22107#discussion_r210488754
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2848,6 +2848,35 @@ setMethod("intersect",
 dataFrame(intersected)
   })
 
+#' intersectAll
+#'
+#' Return a new SparkDataFrame containing rows in both this SparkDataFrame
+#' and another SparkDataFrame while preserving the duplicates.
+#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in
+#' SQL, this function resolves columns by position (not by name).
+#'
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the intersect all 
operation.
+#' @family SparkDataFrame functions
+#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method
+#' @rdname intersectAll
+#' @name intersectAll
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- read.json(path)
+#' df2 <- read.json(path2)
+#' intersectAllDF <- intersectAll(df1, df2)
+#' }
+#' @rdname intersectAll
--- End diff --

ditto here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...

2018-08-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22107#discussion_r210488641
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2876,6 +2905,37 @@ setMethod("except",
 dataFrame(excepted)
   })
 
+#' exceptAll
+#'
+#' Return a new SparkDataFrame containing rows in this SparkDataFrame
+#' but not in another SparkDataFrame while preserving the duplicates.
+#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in
+#' SQL, this function resolves columns by position (not by name).
+#'
+#' @param x a SparkDataFrame.
+#' @param y a SparkDataFrame.
+#' @return A SparkDataFrame containing the result of the except all 
operation.
+#' @family SparkDataFrame functions
+#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method
+#' @rdname exceptAll
+#' @name exceptAll
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' df1 <- read.json(path)
+#' df2 <- read.json(path2)
+#' exceptAllDF <- exceptAll(df1, df2)
+#' }
+#' @rdname exceptAll
--- End diff --

this is a bug in `except` there should only be one `@rdname` for each


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...

2018-08-15 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/22117
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/22112
  
You are perfectly correct @jiangxb1987, that was a silly mistake on my part 
- and not trivial at all !
It should be shuffle dependency we should rely on when traversing the 
dependency tree, not shuffledrdd.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21320
  
@mallman, can you close this and put some efforts there in 
https://github.com/apache/spark/pull/21889? I see no point of leaving this PR 
open.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/22112
  
> IMO we should traverse the dependency graph and rely on how ShuffledRDD 
is configured

A trivial point here - Since `ShuffleDependency` is also a DeveloperAPI, 
it's possible for users to write a customized RDD that behaves like 
`ShuffleRDD`, so we may want to depend on dependencies rather than RDDs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22115: [SPARK-25082] [SQL] improve the javadoc for expm1...

2018-08-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22115


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

2018-08-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/22112
  
I am not sure what the definition of `isIdempotent` here is.

For example, from MapPartitionsRDD :
```
override private[spark] def isIdempotent = {
if (inputOrderSensitive) {
  prev.isIdempotent
} else {
  true
}
  }
```

Consider:
`val rdd1 = rdd.groupBy().map(...).repartition(...).filter(...)`.
By definition above, this would make rdd1 idempotent.
Depending on what the definition of idempotent is (partition level, record 
level, etc) - this can be correct or wrong code.


Similarly, I am not sure why idempotency or ordering is depending on 
`Partitioner`.
IMO we should traverse the dependency graph and rely on how `ShuffledRDD` 
is configured - whether there is a key ordering specified (applies to both 
global sort and per partition sort), whether it is from a checkpoint or marked 
for checkpoint, whether it is from a stable input source, etc.






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22115
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22118: Branch 2.2

2018-08-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22118
  
@speful looks mistakenly open. mind closing this please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22095: [SPARK-23984][K8S] Changed Python Version config to be c...

2018-08-15 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22095
  
@mccheah btw, please add a comment (say "merged to master") after you merge 
a PR - just a convention in this project. FYI. thx.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22031
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2237/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22031
  
**[Test build #94839 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94839/testReport)**
 for PR 22031 at commit 
[`16516ec`](https://github.com/apache/spark/commit/16516ec0862dab60cfbc88683233ed08f6fa5ca5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22031
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94837/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22009
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22009
  
**[Test build #94837 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94837/testReport)**
 for PR 22009 at commit 
[`0318b4b`](https://github.com/apache/spark/commit/0318b4b1dcbfde0024945308578cedf8d4a09168).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-08-15 Thread heary-cao
Github user heary-cao commented on the issue:

https://github.com/apache/spark/pull/21860
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22118: Branch 2.2

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22118
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22118: Branch 2.2

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22118
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22118: Branch 2.2

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22118
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20725
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94838/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20725
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22118: Branch 2.2

2018-08-15 Thread speful
GitHub user speful opened a pull request:

https://github.com/apache/spark/pull/22118

Branch 2.2

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22118.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22118


commit 86609a95af4b700e83638b7416c7e3706c2d64c6
Author: Liang-Chi Hsieh 
Date:   2017-08-08T08:12:41Z

[SPARK-21567][SQL] Dataset should work with type alias

If we create a type alias for a type workable with Dataset, the type alias 
doesn't work with Dataset.

A reproducible case looks like:

object C {
  type TwoInt = (Int, Int)
  def tupleTypeAlias: TwoInt = (1, 1)
}

Seq(1).toDS().map(_ => ("", C.tupleTypeAlias))

It throws an exception like:

type T1 is not a class
scala.ScalaReflectionException: type T1 is not a class
  at 
scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
  ...

This patch accesses the dealias of type in many places in `ScalaReflection` 
to fix it.

Added test case.

Author: Liang-Chi Hsieh 

Closes #18813 from viirya/SPARK-21567.

(cherry picked from commit ee1304199bcd9c1d5fc94f5b06fdd5f6fe7336a1)
Signed-off-by: Wenchen Fan 

commit e87ffcaa3e5b75f8d313dc995e4801063b60cd5c
Author: Wenchen Fan 
Date:   2017-08-08T08:32:49Z

Revert "[SPARK-21567][SQL] Dataset should work with type alias"

This reverts commit 86609a95af4b700e83638b7416c7e3706c2d64c6.

commit d0233145208eb6afcd9fe0c1c3a9dbbd35d7727e
Author: pgandhi 
Date:   2017-08-09T05:46:06Z

[SPARK-21503][UI] Spark UI shows incorrect task status for a killed 
Executor Process

The executor tab on Spark UI page shows task as completed when an executor 
process that is running that task is killed using the kill command.
Added the case ExecutorLostFailure which was previously not there, thus, 
the default case would be executed in which case, task would be marked as 
completed. This case will consider all those cases where executor connection to 
Spark Driver was lost due to killing the executor process, network connection 
etc.

## How was this patch tested?
Manually Tested the fix by observing the UI change before and after.
Before:
https://user-images.githubusercontent.com/8190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png";>
After:
https://user-images.githubusercontent.com/8190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png";>

Please review http://spark.apache.org/contributing.html before opening a 
pull request.

Author: pgandhi 
Author: pgandhi999 

Closes #18707 from pgandhi999/master.

(cherry picked from commit f016f5c8f6c6aae674e9905a5c0b0bede09163a4)
Signed-off-by: Wenchen Fan 

commit 7446be3328ea75a5197b2587e3a8e2ca7977726b
Author: WeichenXu 
Date:   2017-08-09T06:44:10Z

[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong 
wolfe line search

## What changes were proposed in this pull request?

Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651

## How was this patch tested?

N/A

Author: WeichenXu 

Closes #18797 from WeichenXu123/update-breeze.

(cherry picked from commit b35660dd0e930f4b484a079d9e2516b0a7dacf1d)
Signed-off-by: Yanbo Liang 

commit f6d56d2f1c377000921effea2b1faae15f9cae82
Author: Shixiong Zhu 
Date:   2017-08-09T06:49:33Z

[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the 
return value

Same PR as #18799 but for branch 2.2. Main discussion the other PR.


When I was investigating a flaky test, I realized that many places don't 
check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When 
a batch is supposed to be there, the caller just ignores None rather than 
throwing an error. If some bug causes a query doesn't generate a batch metadata 
file, this behavior will hide it and allow the query continuing to run and 
finally delete metadata logs and make it hard to debug.

This PR ensures that places calling HDFSMetadataLog.get always check the

[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20725
  
**[Test build #94838 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94838/testReport)**
 for PR 20725 at commit 
[`461c326`](https://github.com/apache/spark/commit/461c326f00d68a350a1b5c0f7b644f2871ee0a85).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22117
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94834/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22117
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22117
  
**[Test build #94834 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94834/testReport)**
 for PR 22117 at commit 
[`3cad78f`](https://github.com/apache/spark/commit/3cad78f8bb9bc0dc841cd0c31e0b0d52f8e7c764).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...

2018-08-15 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20725
  
This is working now, and BinaryType support is conditional on pyarrow 
0.10.0 or higher being used. @HyukjinKwon @cloud-fan what are your thoughts on 
getting this in for Spark 2.4? I think it would be very useful to have since 
images in Spark use the BInaryType and it will be good to have when integrating 
Spark with DL frameworks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20725
  
**[Test build #94838 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94838/testReport)**
 for PR 20725 at commit 
[`461c326`](https://github.com/apache/spark/commit/461c326f00d68a350a1b5c0f7b644f2871ee0a85).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20725
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...

2018-08-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21537
  
We are fully swamped by the hotfix and regressions of 2.3 release and the 
new features that are targeting to 2.4. We should post some comments in this PR 
earlier. 

Designing an IR for our codegen is the right thing we should do. [If you do 
not agree on this, we can discuss about it.] How to design an IR is a 
challenging task. The whole community is welcome to submit the designs and PRs. 
Everyone can show the ideas. The best idea will win. @HyukjinKwon If you have a 
bandwidth, please also have a try


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20725
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2236/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21860
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21860
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94835/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21860
  
**[Test build #94835 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94835/testReport)**
 for PR 21860 at commit 
[`6ff46d9`](https://github.com/apache/spark/commit/6ff46d941a6ddb29345ea0c563aa68b77f540139).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...

2018-08-15 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21912
  
cc @ueshin


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22031
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94833/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22031
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22031
  
**[Test build #94833 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94833/testReport)**
 for PR 22031 at commit 
[`0342ed9`](https://github.com/apache/spark/commit/0342ed934e65c13c43081f464503800118383a44).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22113: [SPARK-25126] Lazily create Reader for orc files

2018-08-15 Thread raofu
Github user raofu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22113#discussion_r210473687
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala ---
@@ -70,7 +70,7 @@ private[hive] object OrcFileOperator extends Logging {
   hdfsPath.getFileSystem(conf)
 }
 
-listOrcFiles(basePath, conf).iterator.map { path =>
+listOrcFiles(basePath, conf).view.map { path =>
--- End diff --

My bad. I misread the code. Sorry about the noise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22113: [SPARK-25126] Lazily create Reader for orc files

2018-08-15 Thread raofu
Github user raofu closed the pull request at:

https://github.com/apache/spark/pull/22113


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22115
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94831/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()

2018-08-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22115
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()

2018-08-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22115
  
**[Test build #94831 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94831/testReport)**
 for PR 22115 at commit 
[`089c31f`](https://github.com/apache/spark/commit/089c31fcff1a5b84634f5de78c1bd440f738b2f4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22045#discussion_r210469510
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -497,6 +497,53 @@ case class ArrayAggregate(
   override def prettyName: String = "aggregate"
 }
 
+/**
+ * Returns a map that applies the function to each value of the map.
+ */
+@ExpressionDescription(
+usage = "_FUNC_(expr, func) - Transforms values in the map using the 
function.",
+examples = """
+Examples:
+   > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> v + 
1);
+map(array(1, 2, 3), array(2, 3, 4))
+   > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 
v);
+map(array(1, 2, 3), array(2, 4, 6))
+  """,
+since = "2.4.0")
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22045#discussion_r210469494
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -497,6 +497,53 @@ case class ArrayAggregate(
   override def prettyName: String = "aggregate"
 }
 
+/**
+ * Returns a map that applies the function to each value of the map.
+ */
+@ExpressionDescription(
+usage = "_FUNC_(expr, func) - Transforms values in the map using the 
function.",
+examples = """
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22045#discussion_r210471011
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -2302,6 +2302,177 @@ class DataFrameFunctionsSuite extends QueryTest 
with SharedSQLContext {
 assert(ex5.getMessage.contains("function map_zip_with does not support 
ordering on type map"))
   }
 
+  test("transform values function - test primitive data types") {
+val dfExample1 = Seq(
+  Map[Int, Int](1 -> 1, 9 -> 9, 8 -> 8, 7 -> 7)
+).toDF("i")
+
+val dfExample2 = Seq(
+  Map[Boolean, String](false -> "abc", true -> "def")
+).toDF("x")
+
+val dfExample3 = Seq(
+  Map[String, Int]("a" -> 1, "b" -> 2, "c" -> 3)
+).toDF("y")
+
+val dfExample4 = Seq(
+  Map[Int, Double](1 -> 1.0, 2 -> 1.40, 3 -> 1.70)
+).toDF("z")
+
+val dfExample5 = Seq(
+  Map[Int, Array[Int]](1 -> Array(1, 2))
+).toDF("c")
+
+def testMapOfPrimitiveTypesCombination(): Unit = {
+  checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> k + 
v)"),
+Seq(Row(Map(1 -> 2, 9 -> 18, 8 -> 16, 7 -> 14
+
+  checkAnswer(dfExample2.selectExpr(
+"transform_values(x, (k, v) -> if(k, v, CAST(k AS String)))"),
+Seq(Row(Map(false -> "false", true -> "def"
+
+  checkAnswer(dfExample2.selectExpr("transform_values(x, (k, v) -> NOT 
k AND v = 'abc')"),
+Seq(Row(Map(false -> true, true -> false
+
+  checkAnswer(dfExample3.selectExpr("transform_values(y, (k, v) -> v * 
v)"),
+Seq(Row(Map("a" -> 1, "b" -> 4, "c" -> 9
+
+  checkAnswer(dfExample3.selectExpr(
+"transform_values(y, (k, v) -> k || ':' || CAST(v as String))"),
+Seq(Row(Map("a" -> "a:1", "b" -> "b:2", "c" -> "c:3"
+
+  checkAnswer(
+dfExample3.selectExpr("transform_values(y, (k, v) -> concat(k, 
cast(v as String)))"),
+Seq(Row(Map("a" -> "a1", "b" -> "b2", "c" -> "c3"
+
+  checkAnswer(
+dfExample4.selectExpr(
+  "transform_values(" +
+"z,(k, v) -> map_from_arrays(ARRAY(1, 2, 3), " +
+"ARRAY('one', 'two', 'three'))[k] || '_' || CAST(v AS 
String))"),
+Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 ->"three_1.7"
+
+  checkAnswer(
+dfExample4.selectExpr("transform_values(z, (k, v) -> k-v)"),
+Seq(Row(Map(1 -> 0.0, 2 -> 0.6001, 3 -> 1.3
+
+  checkAnswer(
+dfExample5.selectExpr("transform_values(c, (k, v) -> k + 
cardinality(v))"),
+Seq(Row(Map(1 -> 3
+}
+
+// Test with local relation, the Project will be evaluated without 
codegen
+testMapOfPrimitiveTypesCombination()
+dfExample1.cache()
+dfExample2.cache()
+dfExample3.cache()
+dfExample4.cache()
+dfExample5.cache()
+// Test with cached relation, the Project will be evaluated with 
codegen
+testMapOfPrimitiveTypesCombination()
+  }
+
+  test("transform values function - test empty") {
+val dfExample1 = Seq(
+  Map.empty[Integer, Integer]
+).toDF("i")
+
+val dfExample2 = Seq(
+  Map.empty[BigInt, String]
+).toDF("j")
+
+def testEmpty(): Unit = {
+  checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> 
NULL)"),
+Seq(Row(Map.empty[Integer, Integer])))
+
+  checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> 
k)"),
+Seq(Row(Map.empty[Integer, Integer])))
+
+  checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> 
v)"),
+Seq(Row(Map.empty[Integer, Integer])))
+
+  checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> 
0)"),
+Seq(Row(Map.empty[Integer, Integer])))
+
+  checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> 
'value')"),
+Seq(Row(Map.empty[Integer, String])))
+
+  checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> 
true)"),
+Seq(Row(Map.empty[Integer, Boolean])))
+
+  checkAnswer(dfExample2.selectExpr("transform_values(j, (k, v) -> k + 
cast(v as BIGINT))"),
+Seq(Row(Map.empty[BigInt, BigInt])))
+}
+
+testEmpty()
+dfExample1.cache()
+dfExample2.cache()
+testEmpty()
+  }
+
+  test("transform values function - test null values") {
+val dfExample1 = Seq(
+  Map[Int, Integer](1 -> 1, 2 -> 2, 3 -> 3, 4 -> 4)
+).toDF("a")
+
+val dfExample2 = Seq(
+  Map[Int, String](1 -> "a", 2 -> "b", 3 -> null)
 

[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22045#discussion_r210469472
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -497,6 +497,53 @@ case class ArrayAggregate(
   override def prettyName: String = "aggregate"
 }
 
+/**
+ * Returns a map that applies the function to each value of the map.
+ */
+@ExpressionDescription(
+usage = "_FUNC_(expr, func) - Transforms values in the map using the 
function.",
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22045#discussion_r210470513
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -497,6 +497,53 @@ case class ArrayAggregate(
   override def prettyName: String = "aggregate"
 }
 
+/**
+ * Returns a map that applies the function to each value of the map.
+ */
+@ExpressionDescription(
+usage = "_FUNC_(expr, func) - Transforms values in the map using the 
function.",
+examples = """
+Examples:
+   > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> v + 
1);
+map(array(1, 2, 3), array(2, 3, 4))
+   > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + 
v);
+map(array(1, 2, 3), array(2, 4, 6))
+  """,
+since = "2.4.0")
+case class TransformValues(
+argument: Expression,
+function: Expression)
+  extends MapBasedSimpleHigherOrderFunction with CodegenFallback {
+
+  override def nullable: Boolean = argument.nullable
+
+  @transient lazy val MapType(keyType, valueType, valueContainsNull) = 
argument.dataType
+
+  override def dataType: DataType = MapType(keyType, function.dataType, 
valueContainsNull)
+
+  override def bind(f: (Expression, Seq[(DataType, Boolean)]) => 
LambdaFunction)
+  : TransformValues = {
+copy(function = f(function, (keyType, false) :: (valueType, 
valueContainsNull) :: Nil))
+  }
+
+  @transient lazy val LambdaFunction(
+  _, (keyVar: NamedLambdaVariable) :: (valueVar: NamedLambdaVariable) :: 
Nil, _) = function
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22109: [SPARK-25120][CORE][HistoryServer]Fix the problem of Eve...

2018-08-15 Thread deshanxiao
Github user deshanxiao commented on the issue:

https://github.com/apache/spark/pull/22109
  
@vanzin Sorry..SPARK-22850 has fix the problem. Maybe I will track the 
executor lose problem next. Thank you!  @vanzin @squito 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22031#discussion_r210467354
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -442,3 +442,91 @@ case class ArrayAggregate(
 
   override def prettyName: String = "aggregate"
 }
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(left, right, func) - Merges the two given arrays, 
element-wise, into a single array using function. If one array is shorter, 
nulls are appended at the end to match the length of the longer array, before 
applying function.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, 
x));
+   array(('a', 1), ('b', 3), ('c', 5))
+  > SELECT _FUNC_(array(1, 2), array(3, 4), (x, y) -> x + y));
+   array(4, 6)
+  > SELECT _FUNC_(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) 
-> concat(x, y));
+   array('ad', 'be', 'cf')
+  """,
+  since = "2.4.0")
+// scalastyle:on line.size.limit
+case class ArraysZipWith(
+left: Expression,
+right: Expression,
+function: Expression)
+  extends HigherOrderFunction with CodegenFallback with ExpectsInputTypes {
+
+  override def inputs: Seq[Expression] = List(left, right)
+
+  override def functions: Seq[Expression] = List(function)
+
+  def expectingFunctionType: AbstractDataType = AnyDataType
+  @transient lazy val functionForEval: Expression = functionsForEval.head
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(ArrayType, 
ArrayType, expectingFunctionType)
+
+  override def nullable: Boolean = inputs.exists(_.nullable)
+
+  override def dataType: ArrayType = ArrayType(function.dataType, 
function.nullable)
+
+  override def bind(f: (Expression, Seq[(DataType, Boolean)]) => 
LambdaFunction): ArraysZipWith = {
+val (leftElementType, leftContainsNull) = left.dataType match {
+  case ArrayType(elementType, containsNull) => (elementType, 
containsNull)
+  case _ =>
+val ArrayType(elementType, containsNull) = 
ArrayType.defaultConcreteType
+(elementType, containsNull)
+}
+val (rightElementType, rightContainsNull) = right.dataType match {
+  case ArrayType(elementType, containsNull) => (elementType, 
containsNull)
+  case _ =>
+val ArrayType(elementType, containsNull) = 
ArrayType.defaultConcreteType
+(elementType, containsNull)
+}
+copy(function = f(function,
+  (leftElementType, leftContainsNull) :: (rightElementType, 
rightContainsNull) :: Nil))
--- End diff --

If we append `null`s to the shorter array, both of the arguments might be 
`null`, so we should use `true` for nullabilities of the arguments as @mn-mikke 
suggested.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22031#discussion_r210467721
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala
 ---
@@ -396,4 +396,52 @@ class HigherOrderFunctionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   map_zip_with(mbb0, mbbn, concat),
   null)
   }
+
+  test("ZipWith") {
+def zip_with(
+left: Expression,
+right: Expression,
+f: (Expression, Expression) => Expression): Expression = {
+  val ArrayType(leftT, leftContainsNull) = 
left.dataType.asInstanceOf[ArrayType]
+  val ArrayType(rightT, rightContainsNull) = 
right.dataType.asInstanceOf[ArrayType]
+  ZipWith(left, right, createLambda(leftT, leftContainsNull, rightT, 
rightContainsNull, f))
+}
+
+val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, 
containsNull = false))
+val ai1 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, 
containsNull = false))
--- End diff --

What's the difference between `ai0` and `ai1`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22031#discussion_r210467959
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala
 ---
@@ -396,4 +396,52 @@ class HigherOrderFunctionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   map_zip_with(mbb0, mbbn, concat),
   null)
   }
+
+  test("ZipWith") {
+def zip_with(
+left: Expression,
+right: Expression,
+f: (Expression, Expression) => Expression): Expression = {
+  val ArrayType(leftT, leftContainsNull) = 
left.dataType.asInstanceOf[ArrayType]
+  val ArrayType(rightT, rightContainsNull) = 
right.dataType.asInstanceOf[ArrayType]
+  ZipWith(left, right, createLambda(leftT, leftContainsNull, rightT, 
rightContainsNull, f))
+}
+
+val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, 
containsNull = false))
+val ai1 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, 
containsNull = false))
+val ai2 = Literal.create(Seq[Integer](1, null, 3), 
ArrayType(IntegerType, containsNull = true))
+val ai3 = Literal.create(Seq[Integer](1, null), ArrayType(IntegerType, 
containsNull = true))
+val ain = Literal.create(null, ArrayType(IntegerType, containsNull = 
false))
+
+val add: (Expression, Expression) => Expression = (x, y) => x + y
+val plusOne: Expression => Expression = x => x + 1
+
+checkEvaluation(zip_with(ai0, ai1, add), Seq(2, 4, 6))
+checkEvaluation(zip_with(ai3, ai2, add), Seq(2, null, null))
+checkEvaluation(zip_with(ai2, ai3, add), Seq(2, null, null))
+checkEvaluation(zip_with(ain, ain, add), null)
+checkEvaluation(zip_with(ai1, ain, add), null)
+checkEvaluation(zip_with(ain, ai1, add), null)
+
+val as0 = Literal.create(Seq("a", "b", "c"), ArrayType(StringType, 
containsNull = false))
+val as1 = Literal.create(Seq("a", null, "c"), ArrayType(StringType, 
containsNull = true))
+val as2 = Literal.create(Seq("a"), ArrayType(StringType, containsNull 
= true))
+val asn = Literal.create(null, ArrayType(StringType, containsNull = 
false))
+
+val concat: (Expression, Expression) => Expression = (x, y) => 
Concat(Seq(x, y))
+
+checkEvaluation(zip_with(as0, as1, concat), Seq("aa", null, "cc"))
+checkEvaluation(zip_with(as0, as2, concat), Seq("aa", null, null))
+
+val aai1 = Literal.create(Seq(Seq(1, 2, 3), null, Seq(4, 5)),
+  ArrayType(ArrayType(IntegerType, containsNull = false), containsNull 
= true))
+val aai2 = Literal.create(Seq(Seq(1, 2, 3)),
+  ArrayType(ArrayType(IntegerType, containsNull = false), containsNull 
= true))
+checkEvaluation(
+  zip_with(aai1, aai2, (a1, a2) =>
+  Cast(zip_with(transform(a1, plusOne), transform(a2, plusOne), 
add), StringType)),
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22031#discussion_r210468814
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/higher-order-functions.sql ---
@@ -51,3 +51,12 @@ select exists(ys, y -> y > 30) as v from nested;
 
 -- Check for element existence in a null array
 select exists(cast(null as array), y -> y > 30) as v;
+
+-- Zip with array
+select zip_with(ys, zs, (a, b) -> a + size(b)) as v from nested;
+
+-- Zip with array with concat
+select zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> 
concat(x, y)) as v;
+
+-- Zip with array coalesce
+select zip_with(array('a'), array('d', null, 'f'), (x, y) -> coalesce(x, 
y)) as v;
--- End diff --

Can you add a line break at the end of the file?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22031#discussion_r210468640
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -2302,6 +2302,76 @@ class DataFrameFunctionsSuite extends QueryTest with 
SharedSQLContext {
 assert(ex5.getMessage.contains("function map_zip_with does not support 
ordering on type map"))
   }
 
+  test("arrays zip_with function - for primitive types") {
+val df1 = Seq[(Seq[Integer], Seq[Integer])](
+  (Seq(9001, 9002, 9003), Seq(4, 5, 6)),
+  (Seq(1, 2), Seq(3, 4)),
+  (Seq.empty, Seq.empty),
+  (null, null)
+).toDF("val1", "val2")
+val df2 = Seq[(Seq[Integer], Seq[Long])](
+  (Seq(1, null, 3), Seq(1L, 2L)),
+  (Seq(1, 2, 3), Seq(4L, 11L))
+).toDF("val1", "val2")
+
+val expectedValue1 = Seq(
+  Row(Seq(9005, 9007, 9009)),
+  Row(Seq(4, 6)),
+  Row(Seq.empty),
+  Row(null))
+checkAnswer(df1.selectExpr("zip_with(val1, val2, (x, y) -> x + y)"), 
expectedValue1)
+
+val expectedValue2 = Seq(
+  Row(Seq(Row(1.0, 1), Row(2.0, null), Row(null, 3))),
--- End diff --

Why `1.0` or `2.0` instead of `1L` or `2L`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22031#discussion_r210466854
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala
 ---
@@ -687,3 +687,89 @@ case class MapZipWith(left: Expression, right: 
Expression, function: Expression)
 
   override def prettyName: String = "map_zip_with"
 }
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(left, right, func) - Merges the two given arrays, 
element-wise, into a single array using function. If one array is shorter, 
nulls are appended at the end to match the length of the longer array, before 
applying function.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, 
x));
+   array(('a', 1), ('b', 3), ('c', 5))
+  > SELECT _FUNC_(array(1, 2), array(3, 4), (x, y) -> x + y));
+   array(4, 6)
+  > SELECT _FUNC_(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) 
-> concat(x, y));
+   array('ad', 'be', 'cf')
+  """,
+  since = "2.4.0")
+// scalastyle:on line.size.limit
+case class ZipWith(left: Expression, right: Expression, function: 
Expression)
+  extends HigherOrderFunction with CodegenFallback {
+
+  def functionForEval: Expression = functionsForEval.head
+
+  override def arguments: Seq[Expression] = left :: right :: Nil
+
+  override def argumentTypes: Seq[AbstractDataType] = ArrayType :: 
ArrayType :: Nil
+
+  override def functions: Seq[Expression] = List(function)
+
+  override def functionTypes: Seq[AbstractDataType] = AnyDataType :: Nil
+
+  override def nullable: Boolean = left.nullable || right.nullable
+
+  override def dataType: ArrayType = ArrayType(function.dataType, 
function.nullable)
+
+  override def bind(f: (Expression, Seq[(DataType, Boolean)]) => 
LambdaFunction): ZipWith = {
+val (leftElementType, leftContainsNull) = left.dataType match {
+  case ArrayType(elementType, containsNull) => (elementType, 
containsNull)
+  case _ =>
+val ArrayType(elementType, containsNull) = 
ArrayType.defaultConcreteType
+(elementType, containsNull)
+}
+val (rightElementType, rightContainsNull) = right.dataType match {
+  case ArrayType(elementType, containsNull) => (elementType, 
containsNull)
+  case _ =>
+val ArrayType(elementType, containsNull) = 
ArrayType.defaultConcreteType
+(elementType, containsNull)
+}
--- End diff --

Now we can do:

```scala
val ArrayType(leftElementType, leftContainsNull) = left.dataType
val ArrayType(rightElementType, rightContainsNull) = right.dataType
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with

2018-08-15 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22031#discussion_r210467535
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala
 ---
@@ -396,4 +396,52 @@ class HigherOrderFunctionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   map_zip_with(mbb0, mbbn, concat),
   null)
   }
+
+  test("ZipWith") {
+def zip_with(
+left: Expression,
+right: Expression,
+f: (Expression, Expression) => Expression): Expression = {
+  val ArrayType(leftT, leftContainsNull) = 
left.dataType.asInstanceOf[ArrayType]
+  val ArrayType(rightT, rightContainsNull) = 
right.dataType.asInstanceOf[ArrayType]
--- End diff --

nit: we don't need `.asInstanceOf[ArrayType]`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21561#discussion_r210468884
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -246,6 +245,16 @@ class BisectingKMeans private (
 new BisectingKMeansModel(root, this.distanceMeasure)
   }
 
+  /**
+   * Runs the bisecting k-means algorithm.
+   * @param input RDD of vectors
+   * @return model for the bisecting kmeans
+   */
+  @Since("1.6.0")
--- End diff --

Oh right I get it now, this isn't a new method, it's 'replacing' the 
definition above. 👍 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22116: [DOCS]Update configuration.md

2018-08-15 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22116
  
Huh OK I thought I looked and this had been fixed. Good catch. Also there's 
an instance in `cloud-integration.md`, worth fixing too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-15 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/21561#discussion_r210468639
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -246,6 +245,16 @@ class BisectingKMeans private (
 new BisectingKMeansModel(root, this.distanceMeasure)
   }
 
+  /**
+   * Runs the bisecting k-means algorithm.
+   * @param input RDD of vectors
+   * @return model for the bisecting kmeans
+   */
+  @Since("1.6.0")
--- End diff --

`def run(input: RDD[Vector]): BisectingKMeansModel` is a public api since 
1.6, and users can call it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/21561#discussion_r210468107
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -246,6 +245,16 @@ class BisectingKMeans private (
 new BisectingKMeansModel(root, this.distanceMeasure)
   }
 
+  /**
+   * Runs the bisecting k-means algorithm.
+   * @param input RDD of vectors
+   * @return model for the bisecting kmeans
+   */
+  @Since("1.6.0")
--- End diff --

You couldn't call `BisectingKMeans.run(...)` before this, right? it wasn't 
in a superclass or anything. In that sense I think this method needs to be 
marked as new as of 2.4.0, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...

2018-08-15 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/spark/pull/21469
  
@tdas Kindly reminder.
@zsxwing Could you take a quick look at this and share your thought? I 
think the patch is ready to merge, but blocked with slightly conflict of view 
so more voices would be better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...

2018-08-15 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/21561#discussion_r210467653
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -246,6 +245,16 @@ class BisectingKMeans private (
 new BisectingKMeansModel(root, this.distanceMeasure)
   }
 
+  /**
+   * Runs the bisecting k-means algorithm.
+   * @param input RDD of vectors
+   * @return model for the bisecting kmeans
+   */
+  @Since("1.6.0")
--- End diff --

this api was already existing since 1.6.0, so we should keep the since 
annotation?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...

2018-08-15 Thread HeartSaVioR
Github user HeartSaVioR commented on the issue:

https://github.com/apache/spark/pull/21733
  
@tdas Kindly reminder.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()

2018-08-15 Thread bomeng
Github user bomeng commented on the issue:

https://github.com/apache/spark/pull/22115
  
I have already done the global search. That is the only place needs change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >