[spark] branch branch-3.0 updated: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new f3b80f8 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default f3b80f8 is described below commit f3b80f88324e8a1a76d01d13cfc1fc7082238214 Author: Dongjoon Hyun AuthorDate: Tue Sep 29 12:02:45 2020 -0700 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. **BEFORE (spark-3.0.1-bin-hadoop3.2)** ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` **AFTER** ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit cc06266ade5a4eb35089501a3b32736624208d4c) Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 1180501..6f799a5 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil { for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } +if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) { + hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1") +} } private def appendSparkHiveConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = { diff --git a/docs/configuration.md b/docs/configuration.md index 95ff282..36e4f45 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1761,16 +1761,10 @@ Apart from these, the following properties are also available, and may be useful spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version - Dependent on environment + 1 The file output committer algorithm version, valid algorithm version number: 1 or 2. -Version 2 may have better performance, but version 1 may handle failures better in certain situations, -as per https://issues.apache.org/jira/browse/MAPREDUCE-4815;>MAPREDUCE-4815. -The default value depends on the Hadoop version used in an environment: -1 for Hadoop versions lower than 3.0 -2 for Hadoop versions 3.0 and higher -It's important to note that this can change back to 1 again in the future once https://issues.apache.org/jira/browse/MAPREDUCE-7282;>MAPREDUCE-7282 -is fixed and merged. +Note that 2 may cause a correctness issue like MAPREDUCE-7282. 2.2.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (711d8dd -> cc06266)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes add cc06266 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new f3b80f8 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default f3b80f8 is described below commit f3b80f88324e8a1a76d01d13cfc1fc7082238214 Author: Dongjoon Hyun AuthorDate: Tue Sep 29 12:02:45 2020 -0700 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. **BEFORE (spark-3.0.1-bin-hadoop3.2)** ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` **AFTER** ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit cc06266ade5a4eb35089501a3b32736624208d4c) Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 1180501..6f799a5 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil { for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } +if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) { + hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1") +} } private def appendSparkHiveConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = { diff --git a/docs/configuration.md b/docs/configuration.md index 95ff282..36e4f45 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1761,16 +1761,10 @@ Apart from these, the following properties are also available, and may be useful spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version - Dependent on environment + 1 The file output committer algorithm version, valid algorithm version number: 1 or 2. -Version 2 may have better performance, but version 1 may handle failures better in certain situations, -as per https://issues.apache.org/jira/browse/MAPREDUCE-4815;>MAPREDUCE-4815. -The default value depends on the Hadoop version used in an environment: -1 for Hadoop versions lower than 3.0 -2 for Hadoop versions 3.0 and higher -It's important to note that this can change back to 1 again in the future once https://issues.apache.org/jira/browse/MAPREDUCE-7282;>MAPREDUCE-7282 -is fixed and merged. +Note that 2 may cause a correctness issue like MAPREDUCE-7282. 2.2.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (711d8dd -> cc06266)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes add cc06266 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new f3b80f8 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default f3b80f8 is described below commit f3b80f88324e8a1a76d01d13cfc1fc7082238214 Author: Dongjoon Hyun AuthorDate: Tue Sep 29 12:02:45 2020 -0700 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. **BEFORE (spark-3.0.1-bin-hadoop3.2)** ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` **AFTER** ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit cc06266ade5a4eb35089501a3b32736624208d4c) Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 1180501..6f799a5 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil { for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } +if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) { + hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1") +} } private def appendSparkHiveConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = { diff --git a/docs/configuration.md b/docs/configuration.md index 95ff282..36e4f45 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1761,16 +1761,10 @@ Apart from these, the following properties are also available, and may be useful spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version - Dependent on environment + 1 The file output committer algorithm version, valid algorithm version number: 1 or 2. -Version 2 may have better performance, but version 1 may handle failures better in certain situations, -as per https://issues.apache.org/jira/browse/MAPREDUCE-4815;>MAPREDUCE-4815. -The default value depends on the Hadoop version used in an environment: -1 for Hadoop versions lower than 3.0 -2 for Hadoop versions 3.0 and higher -It's important to note that this can change back to 1 again in the future once https://issues.apache.org/jira/browse/MAPREDUCE-7282;>MAPREDUCE-7282 -is fixed and merged. +Note that 2 may cause a correctness issue like MAPREDUCE-7282. 2.2.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (711d8dd -> cc06266)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes add cc06266 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new f3b80f8 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default f3b80f8 is described below commit f3b80f88324e8a1a76d01d13cfc1fc7082238214 Author: Dongjoon Hyun AuthorDate: Tue Sep 29 12:02:45 2020 -0700 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. **BEFORE (spark-3.0.1-bin-hadoop3.2)** ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` **AFTER** ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit cc06266ade5a4eb35089501a3b32736624208d4c) Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 1180501..6f799a5 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil { for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } +if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) { + hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1") +} } private def appendSparkHiveConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = { diff --git a/docs/configuration.md b/docs/configuration.md index 95ff282..36e4f45 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1761,16 +1761,10 @@ Apart from these, the following properties are also available, and may be useful spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version - Dependent on environment + 1 The file output committer algorithm version, valid algorithm version number: 1 or 2. -Version 2 may have better performance, but version 1 may handle failures better in certain situations, -as per https://issues.apache.org/jira/browse/MAPREDUCE-4815;>MAPREDUCE-4815. -The default value depends on the Hadoop version used in an environment: -1 for Hadoop versions lower than 3.0 -2 for Hadoop versions 3.0 and higher -It's important to note that this can change back to 1 again in the future once https://issues.apache.org/jira/browse/MAPREDUCE-7282;>MAPREDUCE-7282 -is fixed and merged. +Note that 2 may cause a correctness issue like MAPREDUCE-7282. 2.2.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (711d8dd -> cc06266)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes add cc06266 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new f3b80f8 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default f3b80f8 is described below commit f3b80f88324e8a1a76d01d13cfc1fc7082238214 Author: Dongjoon Hyun AuthorDate: Tue Sep 29 12:02:45 2020 -0700 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default ### What changes were proposed in this pull request? Apache Spark 3.1's default Hadoop profile is `hadoop-3.2`. Instead of having a warning documentation, this PR aims to use a consistent and safer version of Apache Hadoop file output committer algorithm which is `v1`. This will prevent a silent correctness regression during migration from Apache Spark 2.4/3.0 to Apache Spark 3.1.0. Of course, if there is a user-provided configuration, `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`, that will be used still. ### Why are the changes needed? Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version` depends on the Hadoop version. Apache Hadoop 3.0 switches the default algorithm from `v1` to `v2` and now there exists a discussion to remove `v2`. We had better provide a consistent default behavior of `v1` across various Spark distributions. - [MAPREDUCE-7282](https://issues.apache.org/jira/browse/MAPREDUCE-7282) MR v2 commit algorithm should be deprecated and not the default ### Does this PR introduce _any_ user-facing change? Yes. This changes the default behavior. Users can override this conf. ### How was this patch tested? Manual. **BEFORE (spark-3.0.1-bin-hadoop3.2)** ```scala scala> sc.version res0: String = 3.0.1 scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res1: String = 2 ``` **AFTER** ```scala scala> sc.hadoopConfiguration.get("mapreduce.fileoutputcommitter.algorithm.version") res0: String = 1 ``` Closes #29895 from dongjoon-hyun/SPARK-DEFAUT-COMMITTER. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit cc06266ade5a4eb35089501a3b32736624208d4c) Signed-off-by: Dongjoon Hyun --- .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala index 1180501..6f799a5 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala @@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil { for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } +if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) { + hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1") +} } private def appendSparkHiveConfigs(conf: SparkConf, hadoopConf: Configuration): Unit = { diff --git a/docs/configuration.md b/docs/configuration.md index 95ff282..36e4f45 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1761,16 +1761,10 @@ Apart from these, the following properties are also available, and may be useful spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version - Dependent on environment + 1 The file output committer algorithm version, valid algorithm version number: 1 or 2. -Version 2 may have better performance, but version 1 may handle failures better in certain situations, -as per https://issues.apache.org/jira/browse/MAPREDUCE-4815;>MAPREDUCE-4815. -The default value depends on the Hadoop version used in an environment: -1 for Hadoop versions lower than 3.0 -2 for Hadoop versions 3.0 and higher -It's important to note that this can change back to 1 again in the future once https://issues.apache.org/jira/browse/MAPREDUCE-7282;>MAPREDUCE-7282 -is fixed and merged. +Note that 2 may cause a correctness issue like MAPREDUCE-7282. 2.2.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (711d8dd -> cc06266)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes add cc06266 [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala | 3 +++ docs/configuration.md | 10 ++ 2 files changed, 5 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (39bfae2 -> ae8b35a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 39bfae2 [MINOR][DOCS] Fixing log message for better clarity add ae8b35a [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7766fd1 -> 711d8dd)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7766fd1 [MINOR][DOCS] Fixing log message for better clarity add 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (39bfae2 -> ae8b35a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 39bfae2 [MINOR][DOCS] Fixing log message for better clarity add ae8b35a [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7766fd1 -> 711d8dd)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7766fd1 [MINOR][DOCS] Fixing log message for better clarity add 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (39bfae2 -> ae8b35a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 39bfae2 [MINOR][DOCS] Fixing log message for better clarity add ae8b35a [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7766fd1 -> 711d8dd)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7766fd1 [MINOR][DOCS] Fixing log message for better clarity add 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (39bfae2 -> ae8b35a)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 39bfae2 [MINOR][DOCS] Fixing log message for better clarity add ae8b35a [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7766fd1 -> 711d8dd)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7766fd1 [MINOR][DOCS] Fixing log message for better clarity add 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new ae8b35a [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes ae8b35a is described below commit ae8b35a0d24f8c83597d668875793a8dbca6 Author: Yuming Wang AuthorDate: Tue Sep 29 16:46:04 2020 + [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes ### What changes were proposed in this pull request? This pr fix estimate statistics issue if child has 0 bytes. ### Why are the changes needed? The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example: ![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #29894 from wangyum/SPARK-33018. Authored-by: Yuming Wang Signed-off-by: Wenchen Fan (cherry picked from commit 711d8dd28afd9af92b025f9908534e5f1d575042) Signed-off-by: Wenchen Fan --- .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala index da36db7..a586988 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala @@ -53,7 +53,8 @@ object SizeInBytesOnlyStatsPlanVisitor extends LogicalPlanVisitor[Statistics] { */ override def default(p: LogicalPlan): Statistics = p match { case p: LeafNode => p.computeStats() -case _: LogicalPlan => Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).product) +case _: LogicalPlan => + Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).filter(_ > 0L).product) } override def visitAggregate(p: Aggregate): Statistics = { diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala index 6c5a2b2..cdfc863 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala @@ -551,4 +551,26 @@ class JoinEstimationSuite extends StatsEstimationTestBase { attributeStats = AttributeMap(Nil)) assert(join.stats == expectedStats) } + + test("SPARK-33018 Fix estimate statistics issue if child has 0 bytes") { +case class MyStatsTestPlan( +outputList: Seq[Attribute], +sizeInBytes: BigInt) extends LeafNode { + override def output: Seq[Attribute] = outputList + override def computeStats(): Statistics = Statistics(sizeInBytes = sizeInBytes) +} + +val left = MyStatsTestPlan( + outputList = Seq("key-1-2", "key-2-4").map(nameToAttr), + sizeInBytes = BigInt(100)) + +val right = MyStatsTestPlan( + outputList = Seq("key-1-2", "key-2-3").map(nameToAttr), + sizeInBytes = BigInt(0)) + +val join = Join(left, right, LeftOuter, + Some(EqualTo(nameToAttr("key-2-4"), nameToAttr("key-2-3"))), JoinHint.NONE) + +assert(join.stats == Statistics(sizeInBytes = 100)) + } } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationTestBase.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationTestBase.scala index 9dceca5..0a27e31 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationTestBase.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/StatsEstimationTestBase.scala @@ -26,17 +26,20 @@ import org.apache.spark.sql.types.{IntegerType, StringType} trait StatsEstimationTestBase extends SparkFunSuite { - var originalValue: Boolean = false + var originalCBOValue: Boolean = false + var originalPlanStatsValue: Boolean = false override def
[spark] branch master updated (7766fd1 -> 711d8dd)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7766fd1 [MINOR][DOCS] Fixing log message for better clarity add 711d8dd [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes No new revisions were added by this update. Summary of changes: .../SizeInBytesOnlyStatsPlanVisitor.scala | 3 ++- .../statsEstimation/JoinEstimationSuite.scala | 22 ++ .../statsEstimation/StatsEstimationTestBase.scala | 9 ++--- 3 files changed, 30 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][DOCS] Fixing log message for better clarity
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 39bfae2 [MINOR][DOCS] Fixing log message for better clarity 39bfae2 is described below commit 39bfae25979aecbe8058beb2a4882fde9f141eba Author: Akshat Bordia AuthorDate: Tue Sep 29 08:38:43 2020 -0500 [MINOR][DOCS] Fixing log message for better clarity Fixing log message for better clarity. Closes #29870 from akshatb1/master. Lead-authored-by: Akshat Bordia Co-authored-by: Akshat Bordia Signed-off-by: Sean Owen (cherry picked from commit 7766fd13c9e7cb72b97fdfee224d3958fbe882a0) Signed-off-by: Sean Owen --- core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 40915e3..802100e 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -577,7 +577,7 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria // If spark.executor.heartbeatInterval bigger than spark.network.timeout, // it will almost always cause ExecutorLostFailure. See SPARK-22754. require(executorTimeoutThresholdMs > executorHeartbeatIntervalMs, "The value of " + - s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be no less than the value of " + + s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be greater than the value of " + s"${EXECUTOR_HEARTBEAT_INTERVAL.key}=${executorHeartbeatIntervalMs}ms.") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][DOCS] Fixing log message for better clarity
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 39bfae2 [MINOR][DOCS] Fixing log message for better clarity 39bfae2 is described below commit 39bfae25979aecbe8058beb2a4882fde9f141eba Author: Akshat Bordia AuthorDate: Tue Sep 29 08:38:43 2020 -0500 [MINOR][DOCS] Fixing log message for better clarity Fixing log message for better clarity. Closes #29870 from akshatb1/master. Lead-authored-by: Akshat Bordia Co-authored-by: Akshat Bordia Signed-off-by: Sean Owen (cherry picked from commit 7766fd13c9e7cb72b97fdfee224d3958fbe882a0) Signed-off-by: Sean Owen --- core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 40915e3..802100e 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -577,7 +577,7 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria // If spark.executor.heartbeatInterval bigger than spark.network.timeout, // it will almost always cause ExecutorLostFailure. See SPARK-22754. require(executorTimeoutThresholdMs > executorHeartbeatIntervalMs, "The value of " + - s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be no less than the value of " + + s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be greater than the value of " + s"${EXECUTOR_HEARTBEAT_INTERVAL.key}=${executorHeartbeatIntervalMs}ms.") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f167002 -> 7766fd1)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter add 7766fd1 [MINOR][DOCS] Fixing log message for better clarity No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][DOCS] Fixing log message for better clarity
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 39bfae2 [MINOR][DOCS] Fixing log message for better clarity 39bfae2 is described below commit 39bfae25979aecbe8058beb2a4882fde9f141eba Author: Akshat Bordia AuthorDate: Tue Sep 29 08:38:43 2020 -0500 [MINOR][DOCS] Fixing log message for better clarity Fixing log message for better clarity. Closes #29870 from akshatb1/master. Lead-authored-by: Akshat Bordia Co-authored-by: Akshat Bordia Signed-off-by: Sean Owen (cherry picked from commit 7766fd13c9e7cb72b97fdfee224d3958fbe882a0) Signed-off-by: Sean Owen --- core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 40915e3..802100e 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -577,7 +577,7 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria // If spark.executor.heartbeatInterval bigger than spark.network.timeout, // it will almost always cause ExecutorLostFailure. See SPARK-22754. require(executorTimeoutThresholdMs > executorHeartbeatIntervalMs, "The value of " + - s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be no less than the value of " + + s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be greater than the value of " + s"${EXECUTOR_HEARTBEAT_INTERVAL.key}=${executorHeartbeatIntervalMs}ms.") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f167002 -> 7766fd1)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter add 7766fd1 [MINOR][DOCS] Fixing log message for better clarity No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][DOCS] Fixing log message for better clarity
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 39bfae2 [MINOR][DOCS] Fixing log message for better clarity 39bfae2 is described below commit 39bfae25979aecbe8058beb2a4882fde9f141eba Author: Akshat Bordia AuthorDate: Tue Sep 29 08:38:43 2020 -0500 [MINOR][DOCS] Fixing log message for better clarity Fixing log message for better clarity. Closes #29870 from akshatb1/master. Lead-authored-by: Akshat Bordia Co-authored-by: Akshat Bordia Signed-off-by: Sean Owen (cherry picked from commit 7766fd13c9e7cb72b97fdfee224d3958fbe882a0) Signed-off-by: Sean Owen --- core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 40915e3..802100e 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -577,7 +577,7 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria // If spark.executor.heartbeatInterval bigger than spark.network.timeout, // it will almost always cause ExecutorLostFailure. See SPARK-22754. require(executorTimeoutThresholdMs > executorHeartbeatIntervalMs, "The value of " + - s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be no less than the value of " + + s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be greater than the value of " + s"${EXECUTOR_HEARTBEAT_INTERVAL.key}=${executorHeartbeatIntervalMs}ms.") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f167002 -> 7766fd1)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter add 7766fd1 [MINOR][DOCS] Fixing log message for better clarity No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][DOCS] Fixing log message for better clarity
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 39bfae2 [MINOR][DOCS] Fixing log message for better clarity 39bfae2 is described below commit 39bfae25979aecbe8058beb2a4882fde9f141eba Author: Akshat Bordia AuthorDate: Tue Sep 29 08:38:43 2020 -0500 [MINOR][DOCS] Fixing log message for better clarity Fixing log message for better clarity. Closes #29870 from akshatb1/master. Lead-authored-by: Akshat Bordia Co-authored-by: Akshat Bordia Signed-off-by: Sean Owen (cherry picked from commit 7766fd13c9e7cb72b97fdfee224d3958fbe882a0) Signed-off-by: Sean Owen --- core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala b/core/src/main/scala/org/apache/spark/SparkConf.scala index 40915e3..802100e 100644 --- a/core/src/main/scala/org/apache/spark/SparkConf.scala +++ b/core/src/main/scala/org/apache/spark/SparkConf.scala @@ -577,7 +577,7 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging with Seria // If spark.executor.heartbeatInterval bigger than spark.network.timeout, // it will almost always cause ExecutorLostFailure. See SPARK-22754. require(executorTimeoutThresholdMs > executorHeartbeatIntervalMs, "The value of " + - s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be no less than the value of " + + s"${networkTimeout}=${executorTimeoutThresholdMs}ms must be greater than the value of " + s"${EXECUTOR_HEARTBEAT_INTERVAL.key}=${executorHeartbeatIntervalMs}ms.") } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f167002 -> 7766fd1)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter add 7766fd1 [MINOR][DOCS] Fixing log message for better clarity No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (f167002 -> 7766fd1)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter add 7766fd1 [MINOR][DOCS] Fixing log message for better clarity No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/SparkConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new d3cc564 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter d3cc564 is described below commit d3cc564de2e27d6f40d360a35ac86d17b39f1498 Author: Tom van Bussel AuthorDate: Tue Sep 29 13:05:33 2020 +0200 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter ### What changes were proposed in this pull request? This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter. ### Why are the changes needed? Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array. 5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array. 6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill. 7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail. With the changes in the PR the following will happen instead: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array. 5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`). 6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step. 7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`. Closes #29785 from tomvanbussel/SPARK-32901. Authored-by: Tom van Bussel Signed-off-by: herman (cherry picked from commit f167002522d50eefb261c8ba2d66a23b781a38c4) Signed-off-by: herman --- .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index 71b9a5b..2096453 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -203,6 +203,10 @@ public final class UnsafeExternalSorter extends MemoryConsumer { } if (inMemSorter == null || inMemSorter.numRecords() <= 0) { + // There could still be some memory allocated when there are no records in the in-memory + // sorter. We will not spill it however, to
[spark] branch branch-3.0 updated: [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new d3cc564 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter d3cc564 is described below commit d3cc564de2e27d6f40d360a35ac86d17b39f1498 Author: Tom van Bussel AuthorDate: Tue Sep 29 13:05:33 2020 +0200 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter ### What changes were proposed in this pull request? This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter. ### Why are the changes needed? Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array. 5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array. 6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill. 7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail. With the changes in the PR the following will happen instead: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array. 5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`). 6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step. 7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`. Closes #29785 from tomvanbussel/SPARK-32901. Authored-by: Tom van Bussel Signed-off-by: herman (cherry picked from commit f167002522d50eefb261c8ba2d66a23b781a38c4) Signed-off-by: herman --- .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index 71b9a5b..2096453 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -203,6 +203,10 @@ public final class UnsafeExternalSorter extends MemoryConsumer { } if (inMemSorter == null || inMemSorter.numRecords() <= 0) { + // There could still be some memory allocated when there are no records in the in-memory + // sorter. We will not spill it however, to
[spark] branch master updated (90e86f6 -> f167002)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for add f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter No new revisions were added by this update. Summary of changes: .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new d3cc564 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter d3cc564 is described below commit d3cc564de2e27d6f40d360a35ac86d17b39f1498 Author: Tom van Bussel AuthorDate: Tue Sep 29 13:05:33 2020 +0200 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter ### What changes were proposed in this pull request? This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter. ### Why are the changes needed? Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array. 5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array. 6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill. 7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail. With the changes in the PR the following will happen instead: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array. 5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`). 6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step. 7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`. Closes #29785 from tomvanbussel/SPARK-32901. Authored-by: Tom van Bussel Signed-off-by: herman (cherry picked from commit f167002522d50eefb261c8ba2d66a23b781a38c4) Signed-off-by: herman --- .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index 71b9a5b..2096453 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -203,6 +203,10 @@ public final class UnsafeExternalSorter extends MemoryConsumer { } if (inMemSorter == null || inMemSorter.numRecords() <= 0) { + // There could still be some memory allocated when there are no records in the in-memory + // sorter. We will not spill it however, to
[spark] branch master updated (90e86f6 -> f167002)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for add f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter No new revisions were added by this update. Summary of changes: .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new d3cc564 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter d3cc564 is described below commit d3cc564de2e27d6f40d360a35ac86d17b39f1498 Author: Tom van Bussel AuthorDate: Tue Sep 29 13:05:33 2020 +0200 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter ### What changes were proposed in this pull request? This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter. ### Why are the changes needed? Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array. 5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array. 6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill. 7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail. With the changes in the PR the following will happen instead: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array. 5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`). 6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step. 7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`. Closes #29785 from tomvanbussel/SPARK-32901. Authored-by: Tom van Bussel Signed-off-by: herman (cherry picked from commit f167002522d50eefb261c8ba2d66a23b781a38c4) Signed-off-by: herman --- .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index 71b9a5b..2096453 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -203,6 +203,10 @@ public final class UnsafeExternalSorter extends MemoryConsumer { } if (inMemSorter == null || inMemSorter.numRecords() <= 0) { + // There could still be some memory allocated when there are no records in the in-memory + // sorter. We will not spill it however, to
[spark] branch master updated (90e86f6 -> f167002)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for add f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter No new revisions were added by this update. Summary of changes: .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new d3cc564 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter d3cc564 is described below commit d3cc564de2e27d6f40d360a35ac86d17b39f1498 Author: Tom van Bussel AuthorDate: Tue Sep 29 13:05:33 2020 +0200 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter ### What changes were proposed in this pull request? This PR changes `UnsafeExternalSorter` to no longer allocate any memory while spilling. In particular it removes the allocation of a new pointer array in `UnsafeInMemorySorter`. Instead the new pointer array is allocated whenever the next record is inserted into the sorter. ### Why are the changes needed? Without this change the `UnsafeExternalSorter` could throw an OOM while spilling. The following sequence of events would have triggered an OOM: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array, and tries to allocate a new small pointer array. 5. `TaskMemoryManager` tries to allocate the memory backing the small array using `MemoryManager`, but `MemoryManager` is unwilling to give it any memory, as the `TaskMemoryManager` is still holding on to the memory it got for the new large array. 6. `TaskMemoryManager` again asks `UnsafeExternalSorter` to spill, but this time there is nothing to spill. 7. `UnsafeInMemorySorter` receives less memory than it requested, and causes a `SparkOutOfMemoryError` to be thrown, which causes the current task to fail. With the changes in the PR the following will happen instead: 1. `UnsafeExternalSorter` runs out of space in its pointer array and attempts to allocate a new large array to replace the old one. 2. `TaskMemoryManager` tries to allocate the memory backing the new large array using `MemoryManager`, but `MemoryManager` is only willing to return most but not all of the memory requested. 3. `TaskMemoryManager` asks `UnsafeExternalSorter` to spill, which causes `UnsafeExternalSorter` to spill the current run to disk, to free its record pages and to reset its `UnsafeInMemorySorter`. 4. `UnsafeInMemorySorter` frees the old pointer array. 5. `TaskMemoryManager` returns control to `UnsafeExternalSorter.growPointerArrayIfNecessary` (either by returning the the new large array or by throwing a `SparkOutOfMemoryError`). 6. `UnsafeExternalSorter` either frees the new large array or it ignores the `SparkOutOfMemoryError` depending on what happened in the previous step. 7. `UnsafeExternalSorter` successfully allocates a new small pointer array and operation continues as normal. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tests were added in `UnsafeExternalSorterSuite` and `UnsafeInMemorySorterSuite`. Closes #29785 from tomvanbussel/SPARK-32901. Authored-by: Tom van Bussel Signed-off-by: herman (cherry picked from commit f167002522d50eefb261c8ba2d66a23b781a38c4) Signed-off-by: herman --- .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index 71b9a5b..2096453 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -203,6 +203,10 @@ public final class UnsafeExternalSorter extends MemoryConsumer { } if (inMemSorter == null || inMemSorter.numRecords() <= 0) { + // There could still be some memory allocated when there are no records in the in-memory + // sorter. We will not spill it however, to
[spark] branch master updated (90e86f6 -> f167002)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for add f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter No new revisions were added by this update. Summary of changes: .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (90e86f6 -> f167002)
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for add f167002 [SPARK-32901][CORE] Do not allocate memory while spilling UnsafeExternalSorter No new revisions were added by this update. Summary of changes: .../unsafe/sort/UnsafeExternalSorter.java | 96 -- .../unsafe/sort/UnsafeInMemorySorter.java | 55 +++-- .../unsafe/sort/UnsafeExternalSorterSuite.java | 46 --- .../unsafe/sort/UnsafeInMemorySorterSuite.java | 40 - .../apache/spark/memory/TestMemoryManager.scala| 8 ++ 5 files changed, 143 insertions(+), 102 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 2160dc5 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule 2160dc5 is described below commit 2160dc52163f017bc164ad18ca6ebe6868070402 Author: Max Gekk AuthorDate: Tue Sep 29 19:34:43 2020 +0900 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule ### What changes were proposed in this pull request? Use `millisToDays()` instead of `microsToDays()` because the former one is not available in `branch-3.0`. ### Why are the changes needed? To fix the build failure: ``` [ERROR] [Error] /home/jenkins/workspace/spark-branch-3.0-maven-snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala:85: value microsToDays is not a member of object org.apache.spark.sql.catalyst.util.DateTimeUtils ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./build/sbt clean package` and `ComputeCurrentTimeSuite`. Closes #29901 from MaxGekk/fix-current_date-3.0. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala index 09e0118..ba7e852 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala @@ -82,7 +82,7 @@ object ComputeCurrentTime extends Rule[LogicalPlan] { case currentDate @ CurrentDate(Some(timeZoneId)) => currentDates.getOrElseUpdate(timeZoneId, { Literal.create( -DateTimeUtils.microsToDays(timestamp, currentDate.zoneId), +DateTimeUtils.millisToDays(DateTimeUtils.toMillis(timestamp), currentDate.zoneId), DateType) }) case CurrentTimestamp() | Now() => currentTime - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 2160dc5 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule 2160dc5 is described below commit 2160dc52163f017bc164ad18ca6ebe6868070402 Author: Max Gekk AuthorDate: Tue Sep 29 19:34:43 2020 +0900 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule ### What changes were proposed in this pull request? Use `millisToDays()` instead of `microsToDays()` because the former one is not available in `branch-3.0`. ### Why are the changes needed? To fix the build failure: ``` [ERROR] [Error] /home/jenkins/workspace/spark-branch-3.0-maven-snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala:85: value microsToDays is not a member of object org.apache.spark.sql.catalyst.util.DateTimeUtils ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./build/sbt clean package` and `ComputeCurrentTimeSuite`. Closes #29901 from MaxGekk/fix-current_date-3.0. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala index 09e0118..ba7e852 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala @@ -82,7 +82,7 @@ object ComputeCurrentTime extends Rule[LogicalPlan] { case currentDate @ CurrentDate(Some(timeZoneId)) => currentDates.getOrElseUpdate(timeZoneId, { Literal.create( -DateTimeUtils.microsToDays(timestamp, currentDate.zoneId), +DateTimeUtils.millisToDays(DateTimeUtils.toMillis(timestamp), currentDate.zoneId), DateType) }) case CurrentTimestamp() | Now() => currentTime - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 2160dc5 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule 2160dc5 is described below commit 2160dc52163f017bc164ad18ca6ebe6868070402 Author: Max Gekk AuthorDate: Tue Sep 29 19:34:43 2020 +0900 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule ### What changes were proposed in this pull request? Use `millisToDays()` instead of `microsToDays()` because the former one is not available in `branch-3.0`. ### Why are the changes needed? To fix the build failure: ``` [ERROR] [Error] /home/jenkins/workspace/spark-branch-3.0-maven-snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala:85: value microsToDays is not a member of object org.apache.spark.sql.catalyst.util.DateTimeUtils ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./build/sbt clean package` and `ComputeCurrentTimeSuite`. Closes #29901 from MaxGekk/fix-current_date-3.0. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala index 09e0118..ba7e852 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala @@ -82,7 +82,7 @@ object ComputeCurrentTime extends Rule[LogicalPlan] { case currentDate @ CurrentDate(Some(timeZoneId)) => currentDates.getOrElseUpdate(timeZoneId, { Literal.create( -DateTimeUtils.microsToDays(timestamp, currentDate.zoneId), +DateTimeUtils.millisToDays(DateTimeUtils.toMillis(timestamp), currentDate.zoneId), DateType) }) case CurrentTimestamp() | Now() => currentTime - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 2160dc5 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule 2160dc5 is described below commit 2160dc52163f017bc164ad18ca6ebe6868070402 Author: Max Gekk AuthorDate: Tue Sep 29 19:34:43 2020 +0900 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule ### What changes were proposed in this pull request? Use `millisToDays()` instead of `microsToDays()` because the former one is not available in `branch-3.0`. ### Why are the changes needed? To fix the build failure: ``` [ERROR] [Error] /home/jenkins/workspace/spark-branch-3.0-maven-snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala:85: value microsToDays is not a member of object org.apache.spark.sql.catalyst.util.DateTimeUtils ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./build/sbt clean package` and `ComputeCurrentTimeSuite`. Closes #29901 from MaxGekk/fix-current_date-3.0. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala index 09e0118..ba7e852 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala @@ -82,7 +82,7 @@ object ComputeCurrentTime extends Rule[LogicalPlan] { case currentDate @ CurrentDate(Some(timeZoneId)) => currentDates.getOrElseUpdate(timeZoneId, { Literal.create( -DateTimeUtils.microsToDays(timestamp, currentDate.zoneId), +DateTimeUtils.millisToDays(DateTimeUtils.toMillis(timestamp), currentDate.zoneId), DateType) }) case CurrentTimestamp() | Now() => currentTime - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 2160dc5 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule 2160dc5 is described below commit 2160dc52163f017bc164ad18ca6ebe6868070402 Author: Max Gekk AuthorDate: Tue Sep 29 19:34:43 2020 +0900 [SPARK-33015][SQL][FOLLOWUP][3.0] Use millisToDays() in the ComputeCurrentTime rule ### What changes were proposed in this pull request? Use `millisToDays()` instead of `microsToDays()` because the former one is not available in `branch-3.0`. ### Why are the changes needed? To fix the build failure: ``` [ERROR] [Error] /home/jenkins/workspace/spark-branch-3.0-maven-snapshots/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala:85: value microsToDays is not a member of object org.apache.spark.sql.catalyst.util.DateTimeUtils ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./build/sbt clean package` and `ComputeCurrentTimeSuite`. Closes #29901 from MaxGekk/fix-current_date-3.0. Authored-by: Max Gekk Signed-off-by: HyukjinKwon --- .../scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala index 09e0118..ba7e852 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala @@ -82,7 +82,7 @@ object ComputeCurrentTime extends Rule[LogicalPlan] { case currentDate @ CurrentDate(Some(timeZoneId)) => currentDates.getOrElseUpdate(timeZoneId, { Literal.create( -DateTimeUtils.microsToDays(timestamp, currentDate.zoneId), +DateTimeUtils.millisToDays(DateTimeUtils.toMillis(timestamp), currentDate.zoneId), DateType) }) case CurrentTimestamp() | Now() => currentTime - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (202115e -> 90e86f6)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 202115e [SPARK-32948][SQL] Optimize to_json and from_json expression chain add 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for No new revisions were added by this update. Summary of changes: .../datasources/FileSourceStrategySuite.scala | 25 +- 1 file changed, 15 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (202115e -> 90e86f6)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 202115e [SPARK-32948][SQL] Optimize to_json and from_json expression chain add 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for No new revisions were added by this update. Summary of changes: .../datasources/FileSourceStrategySuite.scala | 25 +- 1 file changed, 15 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (202115e -> 90e86f6)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 202115e [SPARK-32948][SQL] Optimize to_json and from_json expression chain add 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for No new revisions were added by this update. Summary of changes: .../datasources/FileSourceStrategySuite.scala | 25 +- 1 file changed, 15 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (202115e -> 90e86f6)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 202115e [SPARK-32948][SQL] Optimize to_json and from_json expression chain add 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for No new revisions were added by this update. Summary of changes: .../datasources/FileSourceStrategySuite.scala | 25 +- 1 file changed, 15 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (202115e -> 90e86f6)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 202115e [SPARK-32948][SQL] Optimize to_json and from_json expression chain add 90e86f6 [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for No new revisions were added by this update. Summary of changes: .../datasources/FileSourceStrategySuite.scala | 25 +- 1 file changed, 15 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 97d8634 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py 97d8634 is described below commit 97d8634450b39c1f4e5308b8a5308650e1e7489a Author: HyukjinKwon AuthorDate: Mon Sep 28 21:54:00 2020 -0700 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py Move functions related test cases from `test_context.py` to `test_functions.py`. To group the similar test cases. Nope, test-only. Jenkins and GitHub Actions should test. Closes #29898 from HyukjinKwon/SPARK-33021. Authored-by: HyukjinKwon Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/test_context.py | 101 python/pyspark/sql/tests/test_functions.py | 102 - 2 files changed, 101 insertions(+), 102 deletions(-) diff --git a/python/pyspark/sql/tests/test_context.py b/python/pyspark/sql/tests/test_context.py index 92e5434..3a0c7bb 100644 --- a/python/pyspark/sql/tests/test_context.py +++ b/python/pyspark/sql/tests/test_context.py @@ -30,7 +30,6 @@ import py4j from pyspark import SparkContext, SQLContext from pyspark.sql import Row, SparkSession from pyspark.sql.types import * -from pyspark.sql.window import Window from pyspark.testing.utils import ReusedPySparkTestCase @@ -112,99 +111,6 @@ class HiveContextSQLTests(ReusedPySparkTestCase): shutil.rmtree(tmpPath) -def test_window_functions(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.partitionBy("value").orderBy("key") -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 1, 1, 1, 1, 1), -("2", 1, 1, 1, 3, 1, 1, 1, 1), -("2", 1, 2, 1, 3, 2, 1, 1, 1), -("2", 2, 2, 2, 3, 3, 3, 2, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_without_partitionBy(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.orderBy("key", df.value) -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 4, 1, 1, 1, 1), -("2", 1, 1, 1, 4, 2, 2, 2, 1), -("2", 1, 2, 1, 4, 3, 2, 2, 2), -("2", 2, 2, 2, 4, 4, 4, 3, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_cumulative_sum(self): -df = self.spark.createDataFrame([("one", 1), ("two", 2)], ["key", "value"]) -from pyspark.sql import functions as F - -# Test cumulative sum -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values less than JVM's Long.MinValue and make sure we don't overflow -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding - 1, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values greater than JVM's Long.MaxValue and make sure we don't overflow -frame_end = Window.unboundedFollowing + 1 -sel = df.select( -df.key, -
[spark] branch branch-3.0 updated: [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 97d8634 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py 97d8634 is described below commit 97d8634450b39c1f4e5308b8a5308650e1e7489a Author: HyukjinKwon AuthorDate: Mon Sep 28 21:54:00 2020 -0700 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py Move functions related test cases from `test_context.py` to `test_functions.py`. To group the similar test cases. Nope, test-only. Jenkins and GitHub Actions should test. Closes #29898 from HyukjinKwon/SPARK-33021. Authored-by: HyukjinKwon Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/test_context.py | 101 python/pyspark/sql/tests/test_functions.py | 102 - 2 files changed, 101 insertions(+), 102 deletions(-) diff --git a/python/pyspark/sql/tests/test_context.py b/python/pyspark/sql/tests/test_context.py index 92e5434..3a0c7bb 100644 --- a/python/pyspark/sql/tests/test_context.py +++ b/python/pyspark/sql/tests/test_context.py @@ -30,7 +30,6 @@ import py4j from pyspark import SparkContext, SQLContext from pyspark.sql import Row, SparkSession from pyspark.sql.types import * -from pyspark.sql.window import Window from pyspark.testing.utils import ReusedPySparkTestCase @@ -112,99 +111,6 @@ class HiveContextSQLTests(ReusedPySparkTestCase): shutil.rmtree(tmpPath) -def test_window_functions(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.partitionBy("value").orderBy("key") -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 1, 1, 1, 1, 1), -("2", 1, 1, 1, 3, 1, 1, 1, 1), -("2", 1, 2, 1, 3, 2, 1, 1, 1), -("2", 2, 2, 2, 3, 3, 3, 2, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_without_partitionBy(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.orderBy("key", df.value) -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 4, 1, 1, 1, 1), -("2", 1, 1, 1, 4, 2, 2, 2, 1), -("2", 1, 2, 1, 4, 3, 2, 2, 2), -("2", 2, 2, 2, 4, 4, 4, 3, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_cumulative_sum(self): -df = self.spark.createDataFrame([("one", 1), ("two", 2)], ["key", "value"]) -from pyspark.sql import functions as F - -# Test cumulative sum -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values less than JVM's Long.MinValue and make sure we don't overflow -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding - 1, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values greater than JVM's Long.MaxValue and make sure we don't overflow -frame_end = Window.unboundedFollowing + 1 -sel = df.select( -df.key, -
[spark] branch branch-3.0 updated: [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 97d8634 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py 97d8634 is described below commit 97d8634450b39c1f4e5308b8a5308650e1e7489a Author: HyukjinKwon AuthorDate: Mon Sep 28 21:54:00 2020 -0700 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py Move functions related test cases from `test_context.py` to `test_functions.py`. To group the similar test cases. Nope, test-only. Jenkins and GitHub Actions should test. Closes #29898 from HyukjinKwon/SPARK-33021. Authored-by: HyukjinKwon Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/test_context.py | 101 python/pyspark/sql/tests/test_functions.py | 102 - 2 files changed, 101 insertions(+), 102 deletions(-) diff --git a/python/pyspark/sql/tests/test_context.py b/python/pyspark/sql/tests/test_context.py index 92e5434..3a0c7bb 100644 --- a/python/pyspark/sql/tests/test_context.py +++ b/python/pyspark/sql/tests/test_context.py @@ -30,7 +30,6 @@ import py4j from pyspark import SparkContext, SQLContext from pyspark.sql import Row, SparkSession from pyspark.sql.types import * -from pyspark.sql.window import Window from pyspark.testing.utils import ReusedPySparkTestCase @@ -112,99 +111,6 @@ class HiveContextSQLTests(ReusedPySparkTestCase): shutil.rmtree(tmpPath) -def test_window_functions(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.partitionBy("value").orderBy("key") -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 1, 1, 1, 1, 1), -("2", 1, 1, 1, 3, 1, 1, 1, 1), -("2", 1, 2, 1, 3, 2, 1, 1, 1), -("2", 2, 2, 2, 3, 3, 3, 2, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_without_partitionBy(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.orderBy("key", df.value) -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 4, 1, 1, 1, 1), -("2", 1, 1, 1, 4, 2, 2, 2, 1), -("2", 1, 2, 1, 4, 3, 2, 2, 2), -("2", 2, 2, 2, 4, 4, 4, 3, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_cumulative_sum(self): -df = self.spark.createDataFrame([("one", 1), ("two", 2)], ["key", "value"]) -from pyspark.sql import functions as F - -# Test cumulative sum -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values less than JVM's Long.MinValue and make sure we don't overflow -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding - 1, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values greater than JVM's Long.MaxValue and make sure we don't overflow -frame_end = Window.unboundedFollowing + 1 -sel = df.select( -df.key, -
[spark] branch branch-3.0 updated: [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 97d8634 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py 97d8634 is described below commit 97d8634450b39c1f4e5308b8a5308650e1e7489a Author: HyukjinKwon AuthorDate: Mon Sep 28 21:54:00 2020 -0700 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py Move functions related test cases from `test_context.py` to `test_functions.py`. To group the similar test cases. Nope, test-only. Jenkins and GitHub Actions should test. Closes #29898 from HyukjinKwon/SPARK-33021. Authored-by: HyukjinKwon Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/test_context.py | 101 python/pyspark/sql/tests/test_functions.py | 102 - 2 files changed, 101 insertions(+), 102 deletions(-) diff --git a/python/pyspark/sql/tests/test_context.py b/python/pyspark/sql/tests/test_context.py index 92e5434..3a0c7bb 100644 --- a/python/pyspark/sql/tests/test_context.py +++ b/python/pyspark/sql/tests/test_context.py @@ -30,7 +30,6 @@ import py4j from pyspark import SparkContext, SQLContext from pyspark.sql import Row, SparkSession from pyspark.sql.types import * -from pyspark.sql.window import Window from pyspark.testing.utils import ReusedPySparkTestCase @@ -112,99 +111,6 @@ class HiveContextSQLTests(ReusedPySparkTestCase): shutil.rmtree(tmpPath) -def test_window_functions(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.partitionBy("value").orderBy("key") -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 1, 1, 1, 1, 1), -("2", 1, 1, 1, 3, 1, 1, 1, 1), -("2", 1, 2, 1, 3, 2, 1, 1, 1), -("2", 2, 2, 2, 3, 3, 3, 2, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_without_partitionBy(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.orderBy("key", df.value) -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 4, 1, 1, 1, 1), -("2", 1, 1, 1, 4, 2, 2, 2, 1), -("2", 1, 2, 1, 4, 3, 2, 2, 2), -("2", 2, 2, 2, 4, 4, 4, 3, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_cumulative_sum(self): -df = self.spark.createDataFrame([("one", 1), ("two", 2)], ["key", "value"]) -from pyspark.sql import functions as F - -# Test cumulative sum -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values less than JVM's Long.MinValue and make sure we don't overflow -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding - 1, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values greater than JVM's Long.MaxValue and make sure we don't overflow -frame_end = Window.unboundedFollowing + 1 -sel = df.select( -df.key, -
[spark] branch branch-3.0 updated: [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 97d8634 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py 97d8634 is described below commit 97d8634450b39c1f4e5308b8a5308650e1e7489a Author: HyukjinKwon AuthorDate: Mon Sep 28 21:54:00 2020 -0700 [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py Move functions related test cases from `test_context.py` to `test_functions.py`. To group the similar test cases. Nope, test-only. Jenkins and GitHub Actions should test. Closes #29898 from HyukjinKwon/SPARK-33021. Authored-by: HyukjinKwon Signed-off-by: Dongjoon Hyun --- python/pyspark/sql/tests/test_context.py | 101 python/pyspark/sql/tests/test_functions.py | 102 - 2 files changed, 101 insertions(+), 102 deletions(-) diff --git a/python/pyspark/sql/tests/test_context.py b/python/pyspark/sql/tests/test_context.py index 92e5434..3a0c7bb 100644 --- a/python/pyspark/sql/tests/test_context.py +++ b/python/pyspark/sql/tests/test_context.py @@ -30,7 +30,6 @@ import py4j from pyspark import SparkContext, SQLContext from pyspark.sql import Row, SparkSession from pyspark.sql.types import * -from pyspark.sql.window import Window from pyspark.testing.utils import ReusedPySparkTestCase @@ -112,99 +111,6 @@ class HiveContextSQLTests(ReusedPySparkTestCase): shutil.rmtree(tmpPath) -def test_window_functions(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.partitionBy("value").orderBy("key") -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 1, 1, 1, 1, 1), -("2", 1, 1, 1, 3, 1, 1, 1, 1), -("2", 1, 2, 1, 3, 2, 1, 1, 1), -("2", 2, 2, 2, 3, 3, 3, 2, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_without_partitionBy(self): -df = self.spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], ["key", "value"]) -w = Window.orderBy("key", df.value) -from pyspark.sql import functions as F -sel = df.select(df.value, df.key, -F.max("key").over(w.rowsBetween(0, 1)), -F.min("key").over(w.rowsBetween(0, 1)), -F.count("key").over(w.rowsBetween(float('-inf'), float('inf'))), -F.row_number().over(w), -F.rank().over(w), -F.dense_rank().over(w), -F.ntile(2).over(w)) -rs = sorted(sel.collect()) -expected = [ -("1", 1, 1, 1, 4, 1, 1, 1, 1), -("2", 1, 1, 1, 4, 2, 2, 2, 1), -("2", 1, 2, 1, 4, 3, 2, 2, 2), -("2", 2, 2, 2, 4, 4, 4, 3, 2) -] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -def test_window_functions_cumulative_sum(self): -df = self.spark.createDataFrame([("one", 1), ("two", 2)], ["key", "value"]) -from pyspark.sql import functions as F - -# Test cumulative sum -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values less than JVM's Long.MinValue and make sure we don't overflow -sel = df.select( -df.key, -F.sum(df.value).over(Window.rowsBetween(Window.unboundedPreceding - 1, 0))) -rs = sorted(sel.collect()) -expected = [("one", 1), ("two", 3)] -for r, ex in zip(rs, expected): -self.assertEqual(tuple(r), ex[:len(r)]) - -# Test boundary values greater than JVM's Long.MaxValue and make sure we don't overflow -frame_end = Window.unboundedFollowing + 1 -sel = df.select( -df.key, -
[spark] branch master updated (1b60ff5 -> 202115e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 1b60ff5 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated add 202115e [SPARK-32948][SQL] Optimize to_json and from_json expression chain No new revisions were added by this update. Summary of changes: .../sql/catalyst/optimizer/OptimizeJsonExprs.scala | 43 ++ .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../optimizer/OptimizeJsonExprsSuite.scala | 144 + 3 files changed, 188 insertions(+) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprsSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (1b60ff5 -> 202115e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 1b60ff5 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated add 202115e [SPARK-32948][SQL] Optimize to_json and from_json expression chain No new revisions were added by this update. Summary of changes: .../sql/catalyst/optimizer/OptimizeJsonExprs.scala | 43 ++ .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../optimizer/OptimizeJsonExprsSuite.scala | 144 + 3 files changed, 188 insertions(+) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprsSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (424f16e -> 118de10)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 424f16e [SPARK-33015][SQL] Compute the current date only once add 118de10 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated No new revisions were added by this update. Summary of changes: R/pkg/R/functions.R | 6 -- python/pyspark/sql/functions.py | 6 -- .../spark/sql/catalyst/expressions/datetimeExpressions.scala | 12 ++-- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 6 -- 4 files changed, 18 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6868b40 -> 1b60ff5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6868b40 [SPARK-33020][PYTHON] Add nth_value as a PySpark function add 1b60ff5 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated No new revisions were added by this update. Summary of changes: R/pkg/R/functions.R | 6 -- python/pyspark/sql/functions.py | 6 -- .../spark/sql/catalyst/expressions/datetimeExpressions.scala | 12 ++-- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 6 -- 4 files changed, 18 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (1b60ff5 -> 202115e)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 1b60ff5 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated add 202115e [SPARK-32948][SQL] Optimize to_json and from_json expression chain No new revisions were added by this update. Summary of changes: .../sql/catalyst/optimizer/OptimizeJsonExprs.scala | 43 ++ .../spark/sql/catalyst/optimizer/Optimizer.scala | 1 + .../optimizer/OptimizeJsonExprsSuite.scala | 144 + 3 files changed, 188 insertions(+) create mode 100644 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprs.scala create mode 100644 sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeJsonExprsSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (424f16e -> 118de10)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 424f16e [SPARK-33015][SQL] Compute the current date only once add 118de10 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated No new revisions were added by this update. Summary of changes: R/pkg/R/functions.R | 6 -- python/pyspark/sql/functions.py | 6 -- .../spark/sql/catalyst/expressions/datetimeExpressions.scala | 12 ++-- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 6 -- 4 files changed, 18 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6868b40 -> 1b60ff5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6868b40 [SPARK-33020][PYTHON] Add nth_value as a PySpark function add 1b60ff5 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated No new revisions were added by this update. Summary of changes: R/pkg/R/functions.R | 6 -- python/pyspark/sql/functions.py | 6 -- .../spark/sql/catalyst/expressions/datetimeExpressions.scala | 12 ++-- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 6 -- 4 files changed, 18 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated (424f16e -> 118de10)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 424f16e [SPARK-33015][SQL] Compute the current date only once add 118de10 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated No new revisions were added by this update. Summary of changes: R/pkg/R/functions.R | 6 -- python/pyspark/sql/functions.py | 6 -- .../spark/sql/catalyst/expressions/datetimeExpressions.scala | 12 ++-- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 6 -- 4 files changed, 18 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6868b40 -> 1b60ff5)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6868b40 [SPARK-33020][PYTHON] Add nth_value as a PySpark function add 1b60ff5 [MINOR][DOCS] Document when `current_date` and `current_timestamp` are evaluated No new revisions were added by this update. Summary of changes: R/pkg/R/functions.R | 6 -- python/pyspark/sql/functions.py | 6 -- .../spark/sql/catalyst/expressions/datetimeExpressions.scala | 12 ++-- sql/core/src/main/scala/org/apache/spark/sql/functions.scala | 6 -- 4 files changed, 18 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (68cd567 -> 6868b40)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 68cd567 [SPARK-33015][SQL] Compute the current date only once add 6868b40 [SPARK-33020][PYTHON] Add nth_value as a PySpark function No new revisions were added by this update. Summary of changes: python/docs/source/reference/pyspark.sql.rst | 1 + python/pyspark/sql/functions.py | 20 python/pyspark/sql/functions.pyi | 3 +++ python/pyspark/sql/tests/test_functions.py | 34 4 files changed, 58 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (68cd567 -> 6868b40)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 68cd567 [SPARK-33015][SQL] Compute the current date only once add 6868b40 [SPARK-33020][PYTHON] Add nth_value as a PySpark function No new revisions were added by this update. Summary of changes: python/docs/source/reference/pyspark.sql.rst | 1 + python/pyspark/sql/functions.py | 20 python/pyspark/sql/functions.pyi | 3 +++ python/pyspark/sql/tests/test_functions.py | 34 4 files changed, 58 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org