[GitHub] spark issue #21221: [SPARK-23429][CORE] Add executor memory metrics to heart...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/21221 Jenkins, retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r210492311 --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala --- @@ -216,8 +217,7 @@ private[spark] class Executor( def stop(): Unit = { env.metricsSystem.report() -heartbeater.shutdown() -heartbeater.awaitTermination(10, TimeUnit.SECONDS) +heartbeater.stop() --- End diff -- future: `try {} catch { case NonFatal(e)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21221: [SPARK-23429][CORE] Add executor memory metrics t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21221#discussion_r210492513 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -69,6 +69,11 @@ package object config { .bytesConf(ByteUnit.KiB) .createWithDefaultString("100k") + private[spark] val EVENT_LOG_STAGE_EXECUTOR_METRICS = +ConfigBuilder("spark.eventLog.logStageExecutorMetrics.enabled") + .booleanConf + .createWithDefault(true) --- End diff -- should this be "false" for now until we could test this out more, just to be on the safe side? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22110: [SPARK-25122][SQL] Deduplication of supports equa...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/22110#discussion_r210493260 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TypeUtils.scala --- @@ -73,4 +73,14 @@ object TypeUtils { } x.length - y.length } + + /** + * Returns true if elements of the data type could be used as items of a hash set or as keys + * of a hash map. + */ + def typeCanBeHashed(dataType: DataType): Boolean = dataType match { --- End diff -- I will change it :) Just one question to ```hashCode```. If ```case classes``` are used, ```equals``` and ```hashCode``` are generated by compiler. But if we define ```equals``` manually, shouldn't also hold ```a.equals(b) == true``` => ```a.hashCode == b.hashCode```? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21537 For 2. and 3., it is harder to say my opinion in the comment. Let me say short comments at first. For 2., if I remember correctly, @viirya once wrote the API document in a JIRA entry. it would be good to update and add some thoughts about design as a first step. I understand that it is hard to keep the up-to-date design document, in particular, during the open-source development. This is because we have a lot of excellent comments in a PR. For 3., at the early implementation of SQL codegen (i.e. use `s""` to represent code), I thought there are two problems. 1. lack of type of an expression (in other words, `ExprCode` did not have the type of `value`) 2. lack of a structure of statements Now, we meet a problem that it is hard to rewrite a method argument due to problem 1. In my understanding, the current effort led by @viirya is trying to resolve problem 1. It is hard to rewrite a set of statements due to Problem 2. To resolve problem 2, we need more effort. In my opinion, we are addressing them step by step. Of course, it would be happy for me to co-lead a discussion of the IR design for 2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22119 CSV is loaded by default, while AVRO is not. So having a backward compatibility mapping in CSV only still makes sense. Let's remove the mapping for CSV in 3.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21537 If there's a bug, then let's fix in another JIRA. If that's impossible to fix or sounds super risky and there's something I missed, let's revert. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21537 I wouldn't revert this unless there are specific concerns about this. Do you see any bug by a mixture of representation `s""` and `code""`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22119 For details, see the discussion in the JIRA https://issues.apache.org/jira/browse/SPARK-24924 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22119 If we all agree this databricks mapping is not reasonable, I think it's ok to have this inconsistency and remove the mapping for CSV in 3.0. It's weird to make the same mistake just to make things consistent. (again, if we agree this is a mistake) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22111: [SPARK-25123][SQL] Use Block to track code in SimpleExpr...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22111 Let us hold these codegen PRs until we see the design doc for building IR for the codegen? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22119#discussion_r210491239 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -637,6 +635,12 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") +} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || --- End diff -- do we have the same check for kafka? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22098: [SPARK-24886][INFRA] Fix the testing script to increase ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22098 @shaneknapp, seems this was first introduced in https://issues.apache.org/jira/browse/SPARK-3076 / https://github.com/apache/spark/pull/1974 for a good reason fwiw. One thing I am not sure is if we need to have env or not .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22119#discussion_r210491110 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala --- @@ -503,7 +495,7 @@ class AvroSuite extends QueryTest with SharedSQLContext with SQLTestUtils { // get the same values back. withTempPath { tempDir => val name = "AvroTest" - val namespace = "org.apache.spark.avro" + val namespace = "com.databricks.spark.avro" --- End diff -- why change the name space in test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21889 **[Test build #4278 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4278/testReport)** for PR 21889 at commit [`8d822ee`](https://github.com/apache/spark/commit/8d822eea805e1b2dc40b866ca8ac4893e53ad51b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21320 **[Test build #4277 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4277/testReport)** for PR 21320 at commit [`0e5594b`](https://github.com/apache/spark/commit/0e5594b6ac1dcb94f3f0166e66a7d4e7eae3d00c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21537 Thank for involving me in an important thread. I was busy this morning in Japan. I think there are three topics in the thread. 1. Merge or revert this PR 2. Design document 3. IR design For 1., in short, my opinion is likely to revert this PR from the view like a release manager. As we know, it is a time to cut a release branch. This PR partially introduce a new representation. If there would be a bug in code generation at Spark 2.4, it may introduce a complication of debugging and fixing. As @mgaido91 pointed out, #20043 and #21026 have been merged. I think that they are a kind of refactoring (e.g. change the representation of literal, class, and so on). Thus, these two PRs can be there. However, this PR introduces a mixture of representation `s""` and `code""`. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210490145 --- Diff: R/pkg/R/DataFrame.R --- @@ -2848,6 +2848,35 @@ setMethod("intersect", dataFrame(intersected) }) +#' intersectAll +#' +#' Return a new SparkDataFrame containing rows in both this SparkDataFrame +#' and another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the intersect all operation. +#' @family SparkDataFrame functions +#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method +#' @rdname intersectAll +#' @name intersectAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' intersectAllDF <- intersectAll(df1, df2) +#' } +#' @rdname intersectAll +#' @note intersectAll since 2.4.0 +setMethod("intersectAll", + signature(x = "SparkDataFrame", y = "SparkDataFrame"), + function(x, y) { +intersected <- callJMethod(x@sdf, "intersectAll", y@sdf) +dataFrame(intersected) + }) --- End diff -- @felixcheung OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210490166 --- Diff: R/pkg/R/DataFrame.R --- @@ -2876,6 +2905,37 @@ setMethod("except", dataFrame(excepted) }) +#' exceptAll +#' +#' Return a new SparkDataFrame containing rows in this SparkDataFrame +#' but not in another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the except all operation. +#' @family SparkDataFrame functions +#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method +#' @rdname exceptAll +#' @name exceptAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' exceptAllDF <- exceptAll(df1, df2) +#' } +#' @rdname exceptAll +#' @note exceptAll since 2.4.0 +setMethod("exceptAll", + signature(x = "SparkDataFrame", y = "SparkDataFrame"), + function(x, y) { +excepted <- callJMethod(x@sdf, "exceptAll", y@sdf) +dataFrame(excepted) + }) + --- End diff -- @felixcheung Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22119 Sorry if I missed some comments somewhere but just for clarification, should we do it for CSV in 3.0.0? Inconsistency should also be taken into account. Actually configuration sounds making more sense to me and remove it in 3.0.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210490074 --- Diff: R/pkg/R/DataFrame.R --- @@ -2876,6 +2905,37 @@ setMethod("except", dataFrame(excepted) }) +#' exceptAll +#' +#' Return a new SparkDataFrame containing rows in this SparkDataFrame +#' but not in another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the except all operation. +#' @family SparkDataFrame functions +#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method +#' @rdname exceptAll +#' @name exceptAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' exceptAllDF <- exceptAll(df1, df2) +#' } +#' @rdname exceptAll --- End diff -- @felixcheung Thanks .. Did you want the original function `except` fixed at part of this ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21835#discussion_r210489980 --- Diff: R/pkg/R/functions.R --- @@ -3320,7 +3321,7 @@ setMethod("explode", #' @aliases sequence sequence,Column-method #' @note sequence since 2.4.0 setMethod("sequence", - signature(x = "Column", y = "Column"), + signature(), --- End diff -- sorry, I didn't see the reply. yes, we should try to make sequence callable. we shouldn't have to manually call it though and it is better to rely on R internal type/call routing. it's a bit hard to explain but check out `attach` `setGeneric("attach")` or `str` `setGeneric("str")` if you see what I mean. also we should avoid `signature()` empty as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22119 **[Test build #94841 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94841/testReport)** for PR 22119 at commit [`656790e`](https://github.com/apache/spark/commit/656790e7cc05ea1fd31e768ed1a05c18b7881de6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22119 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22119 @tgravescs @dongjoon-hyun @HyukjinKwon @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databr...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22119#discussion_r210489530 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -637,6 +635,12 @@ object DataSource extends Logging { "Hive built-in ORC data source must be used with Hive support enabled. " + "Please use the native ORC data source by setting 'spark.sql.orc.impl' to " + "'native'") +} else if (provider1.toLowerCase(Locale.ROOT) == "avro" || + provider1 == "com.databricks.spark.avro") { + throw new AnalysisException( +s"Failed to find data source: ${provider1.toLowerCase(Locale.ROOT)}. " + +"AVRO is built-in data source since Spark 2.4. Please deploy the application " + +"as per https://spark.apache.org/docs/latest/avro-data-source.html#deploying";) --- End diff -- I am creating a documentation for AVRO data source. Let's merge this PR after the README is done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databricks.sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22119 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2239/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20725: [SPARK-23555][PYTHON] Add BinaryType support for ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20725#discussion_r210489614 --- Diff: python/pyspark/sql/tests.py --- @@ -4331,13 +4354,22 @@ def test_createDataFrame_fallback_enabled(self): self.assertEqual(df.collect(), [Row(a={u'a': 1})]) def test_createDataFrame_fallback_disabled(self): +from distutils.version import LooseVersion import pandas as pd +import pyarrow as pa with QuietTest(self.sc): with self.assertRaisesRegexp(TypeError, 'Unsupported type'): self.spark.createDataFrame( pd.DataFrame([[{u'a': 1}]]), "a: map") +# TODO: remove BinaryType check once minimum pyarrow version is 0.10.0 +if LooseVersion(pa.__version__) < LooseVersion("0.10.0"): +with QuietTest(self.sc): +with self.assertRaisesRegexp(TypeError, 'Unsupported type.*BinaryType'): +self.spark.createDataFrame( +pd.DataFrame([[{'a': b'aaa'}]]), "a: binary") --- End diff -- @BryanCutler, BTW, what kind of data can be binary? `bytearray` and `byte` (`str` in Python 2)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22119: [WIP][SPARK-25129][SQL] Revert mapping com.databr...
GitHub user gengliangwang opened a pull request: https://github.com/apache/spark/pull/22119 [WIP][SPARK-25129][SQL] Revert mapping com.databricks.spark.avro to org.apache.spark.sql.avro ## What changes were proposed in this pull request? In https://issues.apache.org/jira/browse/SPARK-24924, the data source provider com.databricks.spark.avro is mapped to the new package org.apache.spark.sql.avro . Avro is external module and not loaded by default, we should not prevent users from using "com.databricks.spark.avro". ## How was this patch tested? Unit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/gengliangwang/spark revert_avro_mapping Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22119.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22119 commit 4584b2db0e39a4a67685110e35a243757a858319 Author: Gengliang Wang Date: 2018-08-16T04:57:04Z Revert "[SPARK-24924][SQL] Add mapping for built-in Avro data source" This reverts commit 58353d7f4baa8102c3d2f4777a5c407f14993306. commit 656790e7cc05ea1fd31e768ed1a05c18b7881de6 Author: Gengliang Wang Date: 2018-08-16T05:42:19Z improve error message --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22117 **[Test build #94840 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94840/testReport)** for PR 22117 at commit [`3cad78f`](https://github.com/apache/spark/commit/3cad78f8bb9bc0dc841cd0c31e0b0d52f8e7c764). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210488842 --- Diff: R/pkg/R/DataFrame.R --- @@ -2848,6 +2848,35 @@ setMethod("intersect", dataFrame(intersected) }) +#' intersectAll +#' +#' Return a new SparkDataFrame containing rows in both this SparkDataFrame +#' and another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the intersect all operation. +#' @family SparkDataFrame functions +#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method +#' @rdname intersectAll +#' @name intersectAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' intersectAllDF <- intersectAll(df1, df2) +#' } +#' @rdname intersectAll +#' @note intersectAll since 2.4.0 +setMethod("intersectAll", + signature(x = "SparkDataFrame", y = "SparkDataFrame"), + function(x, y) { +intersected <- callJMethod(x@sdf, "intersectAll", y@sdf) +dataFrame(intersected) + }) --- End diff -- add extra empty line after code --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210488890 --- Diff: R/pkg/R/DataFrame.R --- @@ -2876,6 +2905,37 @@ setMethod("except", dataFrame(excepted) }) +#' exceptAll +#' +#' Return a new SparkDataFrame containing rows in this SparkDataFrame +#' but not in another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the except all operation. +#' @family SparkDataFrame functions +#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method +#' @rdname exceptAll +#' @name exceptAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' exceptAllDF <- exceptAll(df1, df2) +#' } +#' @rdname exceptAll +#' @note exceptAll since 2.4.0 +setMethod("exceptAll", + signature(x = "SparkDataFrame", y = "SparkDataFrame"), + function(x, y) { +excepted <- callJMethod(x@sdf, "exceptAll", y@sdf) +dataFrame(excepted) + }) + --- End diff -- nit: remove one of the two empty lines --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22117 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22117 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2238/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210488754 --- Diff: R/pkg/R/DataFrame.R --- @@ -2848,6 +2848,35 @@ setMethod("intersect", dataFrame(intersected) }) +#' intersectAll +#' +#' Return a new SparkDataFrame containing rows in both this SparkDataFrame +#' and another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{INTERSECT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the intersect all operation. +#' @family SparkDataFrame functions +#' @aliases intersectAll,SparkDataFrame,SparkDataFrame-method +#' @rdname intersectAll +#' @name intersectAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' intersectAllDF <- intersectAll(df1, df2) +#' } +#' @rdname intersectAll --- End diff -- ditto here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22107: [SPARK-25117][R] Add EXEPT ALL and INTERSECT ALL ...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/22107#discussion_r210488641 --- Diff: R/pkg/R/DataFrame.R --- @@ -2876,6 +2905,37 @@ setMethod("except", dataFrame(excepted) }) +#' exceptAll +#' +#' Return a new SparkDataFrame containing rows in this SparkDataFrame +#' but not in another SparkDataFrame while preserving the duplicates. +#' This is equivalent to \code{EXCEPT ALL} in SQL. Also as standard in +#' SQL, this function resolves columns by position (not by name). +#' +#' @param x a SparkDataFrame. +#' @param y a SparkDataFrame. +#' @return A SparkDataFrame containing the result of the except all operation. +#' @family SparkDataFrame functions +#' @aliases exceptAll,SparkDataFrame,SparkDataFrame-method +#' @rdname exceptAll +#' @name exceptAll +#' @examples +#'\dontrun{ +#' sparkR.session() +#' df1 <- read.json(path) +#' df2 <- read.json(path2) +#' exceptAllDF <- exceptAll(df1, df2) +#' } +#' @rdname exceptAll --- End diff -- this is a bug in `except` there should only be one `@rdname` for each --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/22117 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 You are perfectly correct @jiangxb1987, that was a silly mistake on my part - and not trivial at all ! It should be shuffle dependency we should rely on when traversing the dependency tree, not shuffledrdd. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21320 @mallman, can you close this and put some efforts there in https://github.com/apache/spark/pull/21889? I see no point of leaving this PR open. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/22112 > IMO we should traverse the dependency graph and rely on how ShuffledRDD is configured A trivial point here - Since `ShuffleDependency` is also a DeveloperAPI, it's possible for users to write a customized RDD that behaves like `ShuffleRDD`, so we may want to depend on dependencies rather than RDDs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22115: [SPARK-25082] [SQL] improve the javadoc for expm1...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22115 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/22112 I am not sure what the definition of `isIdempotent` here is. For example, from MapPartitionsRDD : ``` override private[spark] def isIdempotent = { if (inputOrderSensitive) { prev.isIdempotent } else { true } } ``` Consider: `val rdd1 = rdd.groupBy().map(...).repartition(...).filter(...)`. By definition above, this would make rdd1 idempotent. Depending on what the definition of idempotent is (partition level, record level, etc) - this can be correct or wrong code. Similarly, I am not sure why idempotency or ordering is depending on `Partitioner`. IMO we should traverse the dependency graph and rely on how `ShuffledRDD` is configured - whether there is a key ordering specified (applies to both global sort and per partition sort), whether it is from a checkpoint or marked for checkpoint, whether it is from a stable input source, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22115 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22118: Branch 2.2
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22118 @speful looks mistakenly open. mind closing this please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22095: [SPARK-23984][K8S] Changed Python Version config to be c...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/22095 @mccheah btw, please add a comment (say "merged to master") after you merge a PR - just a convention in this project. FYI. thx. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22031 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2237/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22031 **[Test build #94839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94839/testReport)** for PR 22031 at commit [`16516ec`](https://github.com/apache/spark/commit/16516ec0862dab60cfbc88683233ed08f6fa5ca5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22031 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22009 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94837/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22009 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22009: [SPARK-24882][SQL] improve data source v2 API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22009 **[Test build #94837 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94837/testReport)** for PR 22009 at commit [`0318b4b`](https://github.com/apache/spark/commit/0318b4b1dcbfde0024945308578cedf8d4a09168). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/21860 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22118: Branch 2.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22118 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22118: Branch 2.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22118 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22118: Branch 2.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22118 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20725 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94838/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20725 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22118: Branch 2.2
GitHub user speful opened a pull request: https://github.com/apache/spark/pull/22118 Branch 2.2 ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22118.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22118 commit 86609a95af4b700e83638b7416c7e3706c2d64c6 Author: Liang-Chi Hsieh Date: 2017-08-08T08:12:41Z [SPARK-21567][SQL] Dataset should work with type alias If we create a type alias for a type workable with Dataset, the type alias doesn't work with Dataset. A reproducible case looks like: object C { type TwoInt = (Int, Int) def tupleTypeAlias: TwoInt = (1, 1) } Seq(1).toDS().map(_ => ("", C.tupleTypeAlias)) It throws an exception like: type T1 is not a class scala.ScalaReflectionException: type T1 is not a class at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275) ... This patch accesses the dealias of type in many places in `ScalaReflection` to fix it. Added test case. Author: Liang-Chi Hsieh Closes #18813 from viirya/SPARK-21567. (cherry picked from commit ee1304199bcd9c1d5fc94f5b06fdd5f6fe7336a1) Signed-off-by: Wenchen Fan commit e87ffcaa3e5b75f8d313dc995e4801063b60cd5c Author: Wenchen Fan Date: 2017-08-08T08:32:49Z Revert "[SPARK-21567][SQL] Dataset should work with type alias" This reverts commit 86609a95af4b700e83638b7416c7e3706c2d64c6. commit d0233145208eb6afcd9fe0c1c3a9dbbd35d7727e Author: pgandhi Date: 2017-08-09T05:46:06Z [SPARK-21503][UI] Spark UI shows incorrect task status for a killed Executor Process The executor tab on Spark UI page shows task as completed when an executor process that is running that task is killed using the kill command. Added the case ExecutorLostFailure which was previously not there, thus, the default case would be executed in which case, task would be marked as completed. This case will consider all those cases where executor connection to Spark Driver was lost due to killing the executor process, network connection etc. ## How was this patch tested? Manually Tested the fix by observing the UI change before and after. Before: https://user-images.githubusercontent.com/8190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png";> After: https://user-images.githubusercontent.com/8190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png";> Please review http://spark.apache.org/contributing.html before opening a pull request. Author: pgandhi Author: pgandhi999 Closes #18707 from pgandhi999/master. (cherry picked from commit f016f5c8f6c6aae674e9905a5c0b0bede09163a4) Signed-off-by: Wenchen Fan commit 7446be3328ea75a5197b2587e3a8e2ca7977726b Author: WeichenXu Date: 2017-08-09T06:44:10Z [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search ## What changes were proposed in this pull request? Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search https://github.com/scalanlp/breeze/pull/651 ## How was this patch tested? N/A Author: WeichenXu Closes #18797 from WeichenXu123/update-breeze. (cherry picked from commit b35660dd0e930f4b484a079d9e2516b0a7dacf1d) Signed-off-by: Yanbo Liang commit f6d56d2f1c377000921effea2b1faae15f9cae82 Author: Shixiong Zhu Date: 2017-08-09T06:49:33Z [SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the return value Same PR as #18799 but for branch 2.2. Main discussion the other PR. When I was investigating a flaky test, I realized that many places don't check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When a batch is supposed to be there, the caller just ignores None rather than throwing an error. If some bug causes a query doesn't generate a batch metadata file, this behavior will hide it and allow the query continuing to run and finally delete metadata logs and make it hard to debug. This PR ensures that places calling HDFSMetadataLog.get always check the
[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20725 **[Test build #94838 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94838/testReport)** for PR 20725 at commit [`461c326`](https://github.com/apache/spark/commit/461c326f00d68a350a1b5c0f7b644f2871ee0a85). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22117 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94834/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22117 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22117: [SPARK-23654][BUILD] remove jets3t as a dependency of sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22117 **[Test build #94834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94834/testReport)** for PR 22117 at commit [`3cad78f`](https://github.com/apache/spark/commit/3cad78f8bb9bc0dc841cd0c31e0b0d52f8e7c764). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20725 This is working now, and BinaryType support is conditional on pyarrow 0.10.0 or higher being used. @HyukjinKwon @cloud-fan what are your thoughts on getting this in for Spark 2.4? I think it would be very useful to have since images in Spark use the BInaryType and it will be good to have when integrating Spark with DL frameworks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20725 **[Test build #94838 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94838/testReport)** for PR 20725 at commit [`461c326`](https://github.com/apache/spark/commit/461c326f00d68a350a1b5c0f7b644f2871ee0a85). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20725 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21537: [SPARK-24505][SQL] Convert strings in codegen to blocks:...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21537 We are fully swamped by the hotfix and regressions of 2.3 release and the new features that are targeting to 2.4. We should post some comments in this PR earlier. Designing an IR for our codegen is the right thing we should do. [If you do not agree on this, we can discuss about it.] How to design an IR is a challenging task. The whole community is welcome to submit the designs and PRs. Everyone can show the ideas. The best idea will win. @HyukjinKwon If you have a bandwidth, please also have a try --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20725 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2236/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21860 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21860 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94835/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21860 **[Test build #94835 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94835/testReport)** for PR 21860 at commit [`6ff46d9`](https://github.com/apache/spark/commit/6ff46d941a6ddb29345ea0c563aa68b77f540139). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21912: [SPARK-24962][SQL] Refactor CodeGenerator.createUnsafeAr...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21912 cc @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22031 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94833/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22031 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22031 **[Test build #94833 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94833/testReport)** for PR 22031 at commit [`0342ed9`](https://github.com/apache/spark/commit/0342ed934e65c13c43081f464503800118383a44). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22113: [SPARK-25126] Lazily create Reader for orc files
Github user raofu commented on a diff in the pull request: https://github.com/apache/spark/pull/22113#discussion_r210473687 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala --- @@ -70,7 +70,7 @@ private[hive] object OrcFileOperator extends Logging { hdfsPath.getFileSystem(conf) } -listOrcFiles(basePath, conf).iterator.map { path => +listOrcFiles(basePath, conf).view.map { path => --- End diff -- My bad. I misread the code. Sorry about the noise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22113: [SPARK-25126] Lazily create Reader for orc files
Github user raofu closed the pull request at: https://github.com/apache/spark/pull/22113 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22115 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94831/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22115 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22115 **[Test build #94831 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94831/testReport)** for PR 22115 at commit [`089c31f`](https://github.com/apache/spark/commit/089c31fcff1a5b84634f5de78c1bd440f738b2f4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22045#discussion_r210469510 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala --- @@ -497,6 +497,53 @@ case class ArrayAggregate( override def prettyName: String = "aggregate" } +/** + * Returns a map that applies the function to each value of the map. + */ +@ExpressionDescription( +usage = "_FUNC_(expr, func) - Transforms values in the map using the function.", +examples = """ +Examples: + > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> v + 1); +map(array(1, 2, 3), array(2, 3, 4)) + > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + v); +map(array(1, 2, 3), array(2, 4, 6)) + """, +since = "2.4.0") --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22045#discussion_r210469494 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala --- @@ -497,6 +497,53 @@ case class ArrayAggregate( override def prettyName: String = "aggregate" } +/** + * Returns a map that applies the function to each value of the map. + */ +@ExpressionDescription( +usage = "_FUNC_(expr, func) - Transforms values in the map using the function.", +examples = """ --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22045#discussion_r210471011 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -2302,6 +2302,177 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { assert(ex5.getMessage.contains("function map_zip_with does not support ordering on type map")) } + test("transform values function - test primitive data types") { +val dfExample1 = Seq( + Map[Int, Int](1 -> 1, 9 -> 9, 8 -> 8, 7 -> 7) +).toDF("i") + +val dfExample2 = Seq( + Map[Boolean, String](false -> "abc", true -> "def") +).toDF("x") + +val dfExample3 = Seq( + Map[String, Int]("a" -> 1, "b" -> 2, "c" -> 3) +).toDF("y") + +val dfExample4 = Seq( + Map[Int, Double](1 -> 1.0, 2 -> 1.40, 3 -> 1.70) +).toDF("z") + +val dfExample5 = Seq( + Map[Int, Array[Int]](1 -> Array(1, 2)) +).toDF("c") + +def testMapOfPrimitiveTypesCombination(): Unit = { + checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> k + v)"), +Seq(Row(Map(1 -> 2, 9 -> 18, 8 -> 16, 7 -> 14 + + checkAnswer(dfExample2.selectExpr( +"transform_values(x, (k, v) -> if(k, v, CAST(k AS String)))"), +Seq(Row(Map(false -> "false", true -> "def" + + checkAnswer(dfExample2.selectExpr("transform_values(x, (k, v) -> NOT k AND v = 'abc')"), +Seq(Row(Map(false -> true, true -> false + + checkAnswer(dfExample3.selectExpr("transform_values(y, (k, v) -> v * v)"), +Seq(Row(Map("a" -> 1, "b" -> 4, "c" -> 9 + + checkAnswer(dfExample3.selectExpr( +"transform_values(y, (k, v) -> k || ':' || CAST(v as String))"), +Seq(Row(Map("a" -> "a:1", "b" -> "b:2", "c" -> "c:3" + + checkAnswer( +dfExample3.selectExpr("transform_values(y, (k, v) -> concat(k, cast(v as String)))"), +Seq(Row(Map("a" -> "a1", "b" -> "b2", "c" -> "c3" + + checkAnswer( +dfExample4.selectExpr( + "transform_values(" + +"z,(k, v) -> map_from_arrays(ARRAY(1, 2, 3), " + +"ARRAY('one', 'two', 'three'))[k] || '_' || CAST(v AS String))"), +Seq(Row(Map(1 -> "one_1.0", 2 -> "two_1.4", 3 ->"three_1.7" + + checkAnswer( +dfExample4.selectExpr("transform_values(z, (k, v) -> k-v)"), +Seq(Row(Map(1 -> 0.0, 2 -> 0.6001, 3 -> 1.3 + + checkAnswer( +dfExample5.selectExpr("transform_values(c, (k, v) -> k + cardinality(v))"), +Seq(Row(Map(1 -> 3 +} + +// Test with local relation, the Project will be evaluated without codegen +testMapOfPrimitiveTypesCombination() +dfExample1.cache() +dfExample2.cache() +dfExample3.cache() +dfExample4.cache() +dfExample5.cache() +// Test with cached relation, the Project will be evaluated with codegen +testMapOfPrimitiveTypesCombination() + } + + test("transform values function - test empty") { +val dfExample1 = Seq( + Map.empty[Integer, Integer] +).toDF("i") + +val dfExample2 = Seq( + Map.empty[BigInt, String] +).toDF("j") + +def testEmpty(): Unit = { + checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> NULL)"), +Seq(Row(Map.empty[Integer, Integer]))) + + checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> k)"), +Seq(Row(Map.empty[Integer, Integer]))) + + checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> v)"), +Seq(Row(Map.empty[Integer, Integer]))) + + checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> 0)"), +Seq(Row(Map.empty[Integer, Integer]))) + + checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> 'value')"), +Seq(Row(Map.empty[Integer, String]))) + + checkAnswer(dfExample1.selectExpr("transform_values(i, (k, v) -> true)"), +Seq(Row(Map.empty[Integer, Boolean]))) + + checkAnswer(dfExample2.selectExpr("transform_values(j, (k, v) -> k + cast(v as BIGINT))"), +Seq(Row(Map.empty[BigInt, BigInt]))) +} + +testEmpty() +dfExample1.cache() +dfExample2.cache() +testEmpty() + } + + test("transform values function - test null values") { +val dfExample1 = Seq( + Map[Int, Integer](1 -> 1, 2 -> 2, 3 -> 3, 4 -> 4) +).toDF("a") + +val dfExample2 = Seq( + Map[Int, String](1 -> "a", 2 -> "b", 3 -> null)
[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22045#discussion_r210469472 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala --- @@ -497,6 +497,53 @@ case class ArrayAggregate( override def prettyName: String = "aggregate" } +/** + * Returns a map that applies the function to each value of the map. + */ +@ExpressionDescription( +usage = "_FUNC_(expr, func) - Transforms values in the map using the function.", --- End diff -- nit: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22045: [SPARK-23940][SQL] Add transform_values SQL funct...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22045#discussion_r210470513 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala --- @@ -497,6 +497,53 @@ case class ArrayAggregate( override def prettyName: String = "aggregate" } +/** + * Returns a map that applies the function to each value of the map. + */ +@ExpressionDescription( +usage = "_FUNC_(expr, func) - Transforms values in the map using the function.", +examples = """ +Examples: + > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> v + 1); +map(array(1, 2, 3), array(2, 3, 4)) + > SELECT _FUNC_(map(array(1, 2, 3), array(1, 2, 3)), (k, v) -> k + v); +map(array(1, 2, 3), array(2, 4, 6)) + """, +since = "2.4.0") +case class TransformValues( +argument: Expression, +function: Expression) + extends MapBasedSimpleHigherOrderFunction with CodegenFallback { + + override def nullable: Boolean = argument.nullable + + @transient lazy val MapType(keyType, valueType, valueContainsNull) = argument.dataType + + override def dataType: DataType = MapType(keyType, function.dataType, valueContainsNull) + + override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction) + : TransformValues = { +copy(function = f(function, (keyType, false) :: (valueType, valueContainsNull) :: Nil)) + } + + @transient lazy val LambdaFunction( + _, (keyVar: NamedLambdaVariable) :: (valueVar: NamedLambdaVariable) :: Nil, _) = function --- End diff -- nit: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22109: [SPARK-25120][CORE][HistoryServer]Fix the problem of Eve...
Github user deshanxiao commented on the issue: https://github.com/apache/spark/pull/22109 @vanzin Sorry..SPARK-22850 has fix the problem. Maybe I will track the executor lose problem next. Thank you! @vanzin @squito --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22031#discussion_r210467354 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala --- @@ -442,3 +442,91 @@ case class ArrayAggregate( override def prettyName: String = "aggregate" } + +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(left, right, func) - Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.", + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x)); + array(('a', 1), ('b', 3), ('c', 5)) + > SELECT _FUNC_(array(1, 2), array(3, 4), (x, y) -> x + y)); + array(4, 6) + > SELECT _FUNC_(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y)); + array('ad', 'be', 'cf') + """, + since = "2.4.0") +// scalastyle:on line.size.limit +case class ArraysZipWith( +left: Expression, +right: Expression, +function: Expression) + extends HigherOrderFunction with CodegenFallback with ExpectsInputTypes { + + override def inputs: Seq[Expression] = List(left, right) + + override def functions: Seq[Expression] = List(function) + + def expectingFunctionType: AbstractDataType = AnyDataType + @transient lazy val functionForEval: Expression = functionsForEval.head + + override def inputTypes: Seq[AbstractDataType] = Seq(ArrayType, ArrayType, expectingFunctionType) + + override def nullable: Boolean = inputs.exists(_.nullable) + + override def dataType: ArrayType = ArrayType(function.dataType, function.nullable) + + override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): ArraysZipWith = { +val (leftElementType, leftContainsNull) = left.dataType match { + case ArrayType(elementType, containsNull) => (elementType, containsNull) + case _ => +val ArrayType(elementType, containsNull) = ArrayType.defaultConcreteType +(elementType, containsNull) +} +val (rightElementType, rightContainsNull) = right.dataType match { + case ArrayType(elementType, containsNull) => (elementType, containsNull) + case _ => +val ArrayType(elementType, containsNull) = ArrayType.defaultConcreteType +(elementType, containsNull) +} +copy(function = f(function, + (leftElementType, leftContainsNull) :: (rightElementType, rightContainsNull) :: Nil)) --- End diff -- If we append `null`s to the shorter array, both of the arguments might be `null`, so we should use `true` for nullabilities of the arguments as @mn-mikke suggested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22031#discussion_r210467721 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala --- @@ -396,4 +396,52 @@ class HigherOrderFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper map_zip_with(mbb0, mbbn, concat), null) } + + test("ZipWith") { +def zip_with( +left: Expression, +right: Expression, +f: (Expression, Expression) => Expression): Expression = { + val ArrayType(leftT, leftContainsNull) = left.dataType.asInstanceOf[ArrayType] + val ArrayType(rightT, rightContainsNull) = right.dataType.asInstanceOf[ArrayType] + ZipWith(left, right, createLambda(leftT, leftContainsNull, rightT, rightContainsNull, f)) +} + +val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false)) +val ai1 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false)) --- End diff -- What's the difference between `ai0` and `ai1`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22031#discussion_r210467959 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala --- @@ -396,4 +396,52 @@ class HigherOrderFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper map_zip_with(mbb0, mbbn, concat), null) } + + test("ZipWith") { +def zip_with( +left: Expression, +right: Expression, +f: (Expression, Expression) => Expression): Expression = { + val ArrayType(leftT, leftContainsNull) = left.dataType.asInstanceOf[ArrayType] + val ArrayType(rightT, rightContainsNull) = right.dataType.asInstanceOf[ArrayType] + ZipWith(left, right, createLambda(leftT, leftContainsNull, rightT, rightContainsNull, f)) +} + +val ai0 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false)) +val ai1 = Literal.create(Seq(1, 2, 3), ArrayType(IntegerType, containsNull = false)) +val ai2 = Literal.create(Seq[Integer](1, null, 3), ArrayType(IntegerType, containsNull = true)) +val ai3 = Literal.create(Seq[Integer](1, null), ArrayType(IntegerType, containsNull = true)) +val ain = Literal.create(null, ArrayType(IntegerType, containsNull = false)) + +val add: (Expression, Expression) => Expression = (x, y) => x + y +val plusOne: Expression => Expression = x => x + 1 + +checkEvaluation(zip_with(ai0, ai1, add), Seq(2, 4, 6)) +checkEvaluation(zip_with(ai3, ai2, add), Seq(2, null, null)) +checkEvaluation(zip_with(ai2, ai3, add), Seq(2, null, null)) +checkEvaluation(zip_with(ain, ain, add), null) +checkEvaluation(zip_with(ai1, ain, add), null) +checkEvaluation(zip_with(ain, ai1, add), null) + +val as0 = Literal.create(Seq("a", "b", "c"), ArrayType(StringType, containsNull = false)) +val as1 = Literal.create(Seq("a", null, "c"), ArrayType(StringType, containsNull = true)) +val as2 = Literal.create(Seq("a"), ArrayType(StringType, containsNull = true)) +val asn = Literal.create(null, ArrayType(StringType, containsNull = false)) + +val concat: (Expression, Expression) => Expression = (x, y) => Concat(Seq(x, y)) + +checkEvaluation(zip_with(as0, as1, concat), Seq("aa", null, "cc")) +checkEvaluation(zip_with(as0, as2, concat), Seq("aa", null, null)) + +val aai1 = Literal.create(Seq(Seq(1, 2, 3), null, Seq(4, 5)), + ArrayType(ArrayType(IntegerType, containsNull = false), containsNull = true)) +val aai2 = Literal.create(Seq(Seq(1, 2, 3)), + ArrayType(ArrayType(IntegerType, containsNull = false), containsNull = true)) +checkEvaluation( + zip_with(aai1, aai2, (a1, a2) => + Cast(zip_with(transform(a1, plusOne), transform(a2, plusOne), add), StringType)), --- End diff -- nit: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22031#discussion_r210468814 --- Diff: sql/core/src/test/resources/sql-tests/inputs/higher-order-functions.sql --- @@ -51,3 +51,12 @@ select exists(ys, y -> y > 30) as v from nested; -- Check for element existence in a null array select exists(cast(null as array), y -> y > 30) as v; + +-- Zip with array +select zip_with(ys, zs, (a, b) -> a + size(b)) as v from nested; + +-- Zip with array with concat +select zip_with(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y)) as v; + +-- Zip with array coalesce +select zip_with(array('a'), array('d', null, 'f'), (x, y) -> coalesce(x, y)) as v; --- End diff -- Can you add a line break at the end of the file? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22031#discussion_r210468640 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -2302,6 +2302,76 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSQLContext { assert(ex5.getMessage.contains("function map_zip_with does not support ordering on type map")) } + test("arrays zip_with function - for primitive types") { +val df1 = Seq[(Seq[Integer], Seq[Integer])]( + (Seq(9001, 9002, 9003), Seq(4, 5, 6)), + (Seq(1, 2), Seq(3, 4)), + (Seq.empty, Seq.empty), + (null, null) +).toDF("val1", "val2") +val df2 = Seq[(Seq[Integer], Seq[Long])]( + (Seq(1, null, 3), Seq(1L, 2L)), + (Seq(1, 2, 3), Seq(4L, 11L)) +).toDF("val1", "val2") + +val expectedValue1 = Seq( + Row(Seq(9005, 9007, 9009)), + Row(Seq(4, 6)), + Row(Seq.empty), + Row(null)) +checkAnswer(df1.selectExpr("zip_with(val1, val2, (x, y) -> x + y)"), expectedValue1) + +val expectedValue2 = Seq( + Row(Seq(Row(1.0, 1), Row(2.0, null), Row(null, 3))), --- End diff -- Why `1.0` or `2.0` instead of `1L` or `2L`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22031#discussion_r210466854 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala --- @@ -687,3 +687,89 @@ case class MapZipWith(left: Expression, right: Expression, function: Expression) override def prettyName: String = "map_zip_with" } + +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(left, right, func) - Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function.", + examples = """ +Examples: + > SELECT _FUNC_(array(1, 2, 3), array('a', 'b', 'c'), (x, y) -> (y, x)); + array(('a', 1), ('b', 3), ('c', 5)) + > SELECT _FUNC_(array(1, 2), array(3, 4), (x, y) -> x + y)); + array(4, 6) + > SELECT _FUNC_(array('a', 'b', 'c'), array('d', 'e', 'f'), (x, y) -> concat(x, y)); + array('ad', 'be', 'cf') + """, + since = "2.4.0") +// scalastyle:on line.size.limit +case class ZipWith(left: Expression, right: Expression, function: Expression) + extends HigherOrderFunction with CodegenFallback { + + def functionForEval: Expression = functionsForEval.head + + override def arguments: Seq[Expression] = left :: right :: Nil + + override def argumentTypes: Seq[AbstractDataType] = ArrayType :: ArrayType :: Nil + + override def functions: Seq[Expression] = List(function) + + override def functionTypes: Seq[AbstractDataType] = AnyDataType :: Nil + + override def nullable: Boolean = left.nullable || right.nullable + + override def dataType: ArrayType = ArrayType(function.dataType, function.nullable) + + override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): ZipWith = { +val (leftElementType, leftContainsNull) = left.dataType match { + case ArrayType(elementType, containsNull) => (elementType, containsNull) + case _ => +val ArrayType(elementType, containsNull) = ArrayType.defaultConcreteType +(elementType, containsNull) +} +val (rightElementType, rightContainsNull) = right.dataType match { + case ArrayType(elementType, containsNull) => (elementType, containsNull) + case _ => +val ArrayType(elementType, containsNull) = ArrayType.defaultConcreteType +(elementType, containsNull) +} --- End diff -- Now we can do: ```scala val ArrayType(leftElementType, leftContainsNull) = left.dataType val ArrayType(rightElementType, rightContainsNull) = right.dataType ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22031: [SPARK-23932][SQL] Higher order function zip_with
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22031#discussion_r210467535 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala --- @@ -396,4 +396,52 @@ class HigherOrderFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper map_zip_with(mbb0, mbbn, concat), null) } + + test("ZipWith") { +def zip_with( +left: Expression, +right: Expression, +f: (Expression, Expression) => Expression): Expression = { + val ArrayType(leftT, leftContainsNull) = left.dataType.asInstanceOf[ArrayType] + val ArrayType(rightT, rightContainsNull) = right.dataType.asInstanceOf[ArrayType] --- End diff -- nit: we don't need `.asInstanceOf[ArrayType]`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r210468884 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -246,6 +245,16 @@ class BisectingKMeans private ( new BisectingKMeansModel(root, this.distanceMeasure) } + /** + * Runs the bisecting k-means algorithm. + * @param input RDD of vectors + * @return model for the bisecting kmeans + */ + @Since("1.6.0") --- End diff -- Oh right I get it now, this isn't a new method, it's 'replacing' the definition above. ð --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22116: [DOCS]Update configuration.md
Github user srowen commented on the issue: https://github.com/apache/spark/pull/22116 Huh OK I thought I looked and this had been fixed. Good catch. Also there's an instance in `cloud-integration.md`, worth fixing too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r210468639 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -246,6 +245,16 @@ class BisectingKMeans private ( new BisectingKMeansModel(root, this.distanceMeasure) } + /** + * Runs the bisecting k-means algorithm. + * @param input RDD of vectors + * @return model for the bisecting kmeans + */ + @Since("1.6.0") --- End diff -- `def run(input: RDD[Vector]): BisectingKMeansModel` is a public api since 1.6, and users can call it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r210468107 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -246,6 +245,16 @@ class BisectingKMeans private ( new BisectingKMeansModel(root, this.distanceMeasure) } + /** + * Runs the bisecting k-means algorithm. + * @param input RDD of vectors + * @return model for the bisecting kmeans + */ + @Since("1.6.0") --- End diff -- You couldn't call `BisectingKMeans.run(...)` before this, right? it wasn't in a superclass or anything. In that sense I think this method needs to be marked as new as of 2.4.0, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21469 @tdas Kindly reminder. @zsxwing Could you take a quick look at this and share your thought? I think the patch is ready to merge, but blocked with slightly conflict of view so more voices would be better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21561: [SPARK-24555][ML] logNumExamples in KMeans/BiKM/G...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/21561#discussion_r210467653 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -246,6 +245,16 @@ class BisectingKMeans private ( new BisectingKMeansModel(root, this.distanceMeasure) } + /** + * Runs the bisecting k-means algorithm. + * @param input RDD of vectors + * @return model for the bisecting kmeans + */ + @Since("1.6.0") --- End diff -- this api was already existing since 1.6.0, so we should keep the since annotation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21733: [SPARK-24763][SS] Remove redundant key data from value i...
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/21733 @tdas Kindly reminder. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22115: [SPARK-25082] [SQL] improve the javadoc for expm1()
Github user bomeng commented on the issue: https://github.com/apache/spark/pull/22115 I have already done the global search. That is the only place needs change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org