[spark] branch master updated (b1adc3d -> 88a4e55)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b1adc3d [SPARK-21117][SQL] Built-in SQL Function Support - WIDTH_BUCKET add 88a4e55 [SPARK-31765][WEBUI][TEST-MAVEN] Upgrade HtmlUnit >= 2.37.0 No new revisions were added by this update. Summary of changes: core/pom.xml | 2 +- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 7 ++- .../test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 2 +- pom.xml| 14 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- streaming/pom.xml | 2 +- 7 files changed, 20 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ca2cfd4 -> 6befb2d)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ca2cfd4 [SPARK-31906][SQL][DOCS] Enhance comments in NamedExpression.qualifier add 6befb2d [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to control spark-submit exit in Standalone Cluster Mode No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/deploy/Client.scala | 95 -- docs/spark-standalone.md | 19 + 2 files changed, 88 insertions(+), 26 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ca2cfd4 -> 6befb2d)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ca2cfd4 [SPARK-31906][SQL][DOCS] Enhance comments in NamedExpression.qualifier add 6befb2d [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to control spark-submit exit in Standalone Cluster Mode No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/deploy/Client.scala | 95 -- docs/spark-standalone.md | 19 + 2 files changed, 88 insertions(+), 26 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ca2cfd4 -> 6befb2d)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ca2cfd4 [SPARK-31906][SQL][DOCS] Enhance comments in NamedExpression.qualifier add 6befb2d [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to control spark-submit exit in Standalone Cluster Mode No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/deploy/Client.scala | 95 -- docs/spark-standalone.md | 19 + 2 files changed, 88 insertions(+), 26 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ca2cfd4 -> 6befb2d)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ca2cfd4 [SPARK-31906][SQL][DOCS] Enhance comments in NamedExpression.qualifier add 6befb2d [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to control spark-submit exit in Standalone Cluster Mode No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/deploy/Client.scala | 95 -- docs/spark-standalone.md | 19 + 2 files changed, 88 insertions(+), 26 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (ca2cfd4 -> 6befb2d)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ca2cfd4 [SPARK-31906][SQL][DOCS] Enhance comments in NamedExpression.qualifier add 6befb2d [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to control spark-submit exit in Standalone Cluster Mode No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/deploy/Client.scala | 95 -- docs/spark-standalone.md | 19 + 2 files changed, 88 insertions(+), 26 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3bb0824 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide 3bb0824 is described below commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a Author: Enrico Minack AuthorDate: Wed Jun 3 18:06:13 2020 -0500 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack Signed-off-by: Sean Owen (cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66) Signed-off-by: Sean Owen --- docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md index 6f0fbbf..2c9ea41 100644 --- a/docs/pyspark-migration-guide.md +++ b/docs/pyspark-migration-guide.md @@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide. - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - this environment variable must be consistent on all executors and driver; otherwise, it may cause failures or incorrect answers. For [...] +- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any `set*(self, value)` setter methods anymore, use the respective `self.set(self.*, value)` instead. See [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details. + ## Upgrading from PySpark 2.3 to 2.4 - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3bb0824 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide 3bb0824 is described below commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a Author: Enrico Minack AuthorDate: Wed Jun 3 18:06:13 2020 -0500 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack Signed-off-by: Sean Owen (cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66) Signed-off-by: Sean Owen --- docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md index 6f0fbbf..2c9ea41 100644 --- a/docs/pyspark-migration-guide.md +++ b/docs/pyspark-migration-guide.md @@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide. - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - this environment variable must be consistent on all executors and driver; otherwise, it may cause failures or incorrect answers. For [...] +- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any `set*(self, value)` setter methods anymore, use the respective `self.set(self.*, value)` instead. See [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details. + ## Upgrading from PySpark 2.3 to 2.4 - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3bb0824 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide 3bb0824 is described below commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a Author: Enrico Minack AuthorDate: Wed Jun 3 18:06:13 2020 -0500 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack Signed-off-by: Sean Owen (cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66) Signed-off-by: Sean Owen --- docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md index 6f0fbbf..2c9ea41 100644 --- a/docs/pyspark-migration-guide.md +++ b/docs/pyspark-migration-guide.md @@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide. - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - this environment variable must be consistent on all executors and driver; otherwise, it may cause failures or incorrect answers. For [...] +- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any `set*(self, value)` setter methods anymore, use the respective `self.set(self.*, value)` instead. See [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details. + ## Upgrading from PySpark 2.3 to 2.4 - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dc0709f -> 4bbe3c2)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dc0709f [SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return relations with fresh attribute IDs add 4bbe3c2 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide No new revisions were added by this update. Summary of changes: docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3bb0824 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide 3bb0824 is described below commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a Author: Enrico Minack AuthorDate: Wed Jun 3 18:06:13 2020 -0500 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack Signed-off-by: Sean Owen (cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66) Signed-off-by: Sean Owen --- docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md index 6f0fbbf..2c9ea41 100644 --- a/docs/pyspark-migration-guide.md +++ b/docs/pyspark-migration-guide.md @@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide. - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - this environment variable must be consistent on all executors and driver; otherwise, it may cause failures or incorrect answers. For [...] +- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any `set*(self, value)` setter methods anymore, use the respective `self.set(self.*, value)` instead. See [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details. + ## Upgrading from PySpark 2.3 to 2.4 - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dc0709f -> 4bbe3c2)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dc0709f [SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return relations with fresh attribute IDs add 4bbe3c2 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide No new revisions were added by this update. Summary of changes: docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 3bb0824 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide 3bb0824 is described below commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a Author: Enrico Minack AuthorDate: Wed Jun 3 18:06:13 2020 -0500 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack Signed-off-by: Sean Owen (cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66) Signed-off-by: Sean Owen --- docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md index 6f0fbbf..2c9ea41 100644 --- a/docs/pyspark-migration-guide.md +++ b/docs/pyspark-migration-guide.md @@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide. - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - this environment variable must be consistent on all executors and driver; otherwise, it may cause failures or incorrect answers. For [...] +- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any `set*(self, value)` setter methods anymore, use the respective `self.set(self.*, value)` instead. See [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details. + ## Upgrading from PySpark 2.3 to 2.4 - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dc0709f -> 4bbe3c2)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dc0709f [SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return relations with fresh attribute IDs add 4bbe3c2 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide No new revisions were added by this update. Summary of changes: docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4bbe3c2 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide 4bbe3c2 is described below commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66 Author: Enrico Minack AuthorDate: Wed Jun 3 18:06:13 2020 -0500 [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack Signed-off-by: Sean Owen --- docs/pyspark-migration-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md index 6f0fbbf..2c9ea41 100644 --- a/docs/pyspark-migration-guide.md +++ b/docs/pyspark-migration-guide.md @@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide. - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when constructing with named arguments for Python versions 3.6 and above, and the order of fields will match that as entered. To enable sorted fields by default, as in Spark 2.4, set the environment variable `PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - this environment variable must be consistent on all executors and driver; otherwise, it may cause failures or incorrect answers. For [...] +- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any `set*(self, value)` setter methods anymore, use the respective `self.set(self.*, value)` instead. See [SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details. + ## Upgrading from PySpark 2.3 to 2.4 - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d79a8a8 -> e5c3463)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d79a8a8 [SPARK-31834][SQL] Improve error message for incompatible data types add e5c3463 [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 No new revisions were added by this update. Summary of changes: core/pom.xml | 2 +- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 7 ++- core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 2 +- pom.xml | 10 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- streaming/pom.xml | 2 +- 7 files changed, 16 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d79a8a8 -> e5c3463)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d79a8a8 [SPARK-31834][SQL] Improve error message for incompatible data types add e5c3463 [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 No new revisions were added by this update. Summary of changes: core/pom.xml | 2 +- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 7 ++- core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 2 +- pom.xml | 10 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- streaming/pom.xml | 2 +- 7 files changed, 16 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d79a8a8 -> e5c3463)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d79a8a8 [SPARK-31834][SQL] Improve error message for incompatible data types add e5c3463 [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 No new revisions were added by this update. Summary of changes: core/pom.xml | 2 +- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 7 ++- core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 2 +- pom.xml | 10 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- streaming/pom.xml | 2 +- 7 files changed, 16 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d79a8a8 -> e5c3463)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d79a8a8 [SPARK-31834][SQL] Improve error message for incompatible data types add e5c3463 [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 No new revisions were added by this update. Summary of changes: core/pom.xml | 2 +- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 7 ++- core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 2 +- pom.xml | 10 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- streaming/pom.xml | 2 +- 7 files changed, 16 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d79a8a8 -> e5c3463)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d79a8a8 [SPARK-31834][SQL] Improve error message for incompatible data types add e5c3463 [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 No new revisions were added by this update. Summary of changes: core/pom.xml | 2 +- core/src/main/scala/org/apache/spark/ui/JettyUtils.scala | 7 ++- core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala | 2 +- pom.xml | 10 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- streaming/pom.xml | 2 +- 7 files changed, 16 insertions(+), 11 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e70df2c -> 6a895d0)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e70df2c [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue add 6a895d0 [SPARK-31804][WEBUI] Add real headless browser support for HistoryServer tests No new revisions were added by this update. Summary of changes: .../history/ChromeUIHistoryServerSuite.scala} | 7 +- .../spark/deploy/history/HistoryServerSuite.scala | 62 - .../history/RealBrowserUIHistoryServerSuite.scala | 155 + 3 files changed, 159 insertions(+), 65 deletions(-) copy core/src/test/scala/org/apache/spark/{ui/ChromeUISeleniumSuite.scala => deploy/history/ChromeUIHistoryServerSuite.scala} (88%) create mode 100644 core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (e70df2c -> 6a895d0)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from e70df2c [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue add 6a895d0 [SPARK-31804][WEBUI] Add real headless browser support for HistoryServer tests No new revisions were added by this update. Summary of changes: .../history/ChromeUIHistoryServerSuite.scala} | 7 +- .../spark/deploy/history/HistoryServerSuite.scala | 62 - .../history/RealBrowserUIHistoryServerSuite.scala | 155 + 3 files changed, 159 insertions(+), 65 deletions(-) copy core/src/test/scala/org/apache/spark/{ui/ChromeUISeleniumSuite.scala => deploy/history/ChromeUIHistoryServerSuite.scala} (88%) create mode 100644 core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bc24c99 -> e70df2c)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bc24c99 [SPARK-31837][CORE] Shift to the new highest locality level if there is when recomputeLocality add e70df2c [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue No new revisions were added by this update. Summary of changes: .../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bc24c99 -> e70df2c)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bc24c99 [SPARK-31837][CORE] Shift to the new highest locality level if there is when recomputeLocality add e70df2c [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue No new revisions were added by this update. Summary of changes: .../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e70df2c [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue e70df2c is described below commit e70df2cea46f71461d8d401a420e946f999862c1 Author: Yuexin Zhang AuthorDate: Mon Jun 1 09:46:18 2020 -0500 [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue ### What changes were proposed in this pull request? Improve the check logic on if all node managers are really being backlisted. ### Why are the changes needed? I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens: ... 20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately. ... 20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing. ... then the spark job would suddenly run into AllNodeBlacklisted state: ... 20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted) ... but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119 We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A minor change. No changes on tests. Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue. Authored-by: Yuexin Zhang Signed-off-by: Sean Owen --- .../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala index fa8c961..339d371 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala @@ -103,7 +103,14 @@ private[spark] class YarnAllocatorBlacklistTracker( refreshBlacklistedNodes() } - def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes + def isAllNodeBlacklisted: Boolean = { +if (numClusterNodes <= 0) { + logWarning("No available nodes reported, please check Resource Manager.") + false +} else { + currentBlacklistedYarnNodes.size >= numClusterNodes +} + } private def refreshBlacklistedNodes(): Unit = { removeExpiredYarnBlacklistedNodes() - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (47dc332 -> 45cf5e9)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 47dc332 [SPARK-31874][SQL] Use `FastDateFormat` as the legacy fractional formatter add 45cf5e9 [SPARK-31840][ML] Add instance weight support in LogisticRegressionSummary No new revisions were added by this update. Summary of changes: .../ml/classification/LogisticRegression.scala | 99 +- .../classification/LogisticRegressionSuite.scala | 61 + project/MimaExcludes.scala | 6 +- python/pyspark/ml/classification.py| 11 +++ 4 files changed, 134 insertions(+), 43 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (47dc332 -> 45cf5e9)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 47dc332 [SPARK-31874][SQL] Use `FastDateFormat` as the legacy fractional formatter add 45cf5e9 [SPARK-31840][ML] Add instance weight support in LogisticRegressionSummary No new revisions were added by this update. Summary of changes: .../ml/classification/LogisticRegression.scala | 99 +- .../classification/LogisticRegressionSuite.scala | 61 + project/MimaExcludes.scala | 6 +- python/pyspark/ml/classification.py| 11 +++ 4 files changed, 134 insertions(+), 43 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 5fa46eb [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference 5fa46eb is described below commit 5fa46eb3d50281943a446e6d10fc7c6621c011cd Author: Huaxin Gao AuthorDate: Sat May 30 14:51:45 2020 -0500 [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference To make SQL reference complete https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png;> https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png;> Only the the above pages are changed. The following two pages are the same as before. https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png;> https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png;> Manually build and check Closes #28672 from huaxingao/coalesce_hint. Authored-by: Huaxin Gao Signed-off-by: Sean Owen (cherry picked from commit 1b780f364bfbb46944fe805a024bb6c32f5d2dde) Signed-off-by: Sean Owen --- docs/_data/menu-sql.yaml | 8 +-- docs/sql-performance-tuning.md | 4 ++ docs/sql-ref-syntax-qry-select-hints.md| 83 -- docs/sql-ref-syntax-qry-select-join.md | 2 +- ...ng.md => sql-ref-syntax-qry-select-sampling.md} | 0 ...ndow.md => sql-ref-syntax-qry-select-window.md} | 0 docs/sql-ref-syntax-qry-select.md | 6 +- docs/sql-ref-syntax-qry.md | 6 +- docs/sql-ref-syntax.md | 6 +- 9 files changed, 95 insertions(+), 20 deletions(-) diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index 289a9d3..219e680 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -171,22 +171,22 @@ url: sql-ref-syntax-qry-select-limit.html - text: Common Table Expression url: sql-ref-syntax-qry-select-cte.html +- text: Hints + url: sql-ref-syntax-qry-select-hints.html - text: Inline Table url: sql-ref-syntax-qry-select-inline-table.html - text: JOIN url: sql-ref-syntax-qry-select-join.html -- text: Join Hints - url: sql-ref-syntax-qry-select-hints.html - text: LIKE Predicate url: sql-ref-syntax-qry-select-like.html - text: Set Operators url: sql-ref-syntax-qry-select-setops.html - text: TABLESAMPLE - url: sql-ref-syntax-qry-sampling.html + url: sql-ref-syntax-qry-select-sampling.html - text: Table-valued Function url: sql-ref-syntax-qry-select-tvf.html - text: Window Function - url: sql-ref-syntax-qry-window.html + url: sql-ref-syntax-qry-select-window.html - text: EXPLAIN url: sql-ref-syntax-qry-explain.html - text: Auxiliary Statements diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 5b784a5..5e6f049 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -179,6 +179,8 @@ SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key +For more details please refer to the documentation of [Join Hints](sql-ref-syntax-qry-select-hints.html#join-hints). + ## Coalesce Hints for SQL Queries Coalesce hints allows the Spark SQL users to control the number of output files just like the @@ -194,6 +196,8 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t +For more details please refer to the documentation of [Partitioning Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints). + ## Adaptive Query Execution Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, converting sort-merge join to br
[spark] branch branch-3.0 updated: [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 5fa46eb [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference 5fa46eb is described below commit 5fa46eb3d50281943a446e6d10fc7c6621c011cd Author: Huaxin Gao AuthorDate: Sat May 30 14:51:45 2020 -0500 [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference To make SQL reference complete https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png;> https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png;> Only the the above pages are changed. The following two pages are the same as before. https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png;> https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png;> Manually build and check Closes #28672 from huaxingao/coalesce_hint. Authored-by: Huaxin Gao Signed-off-by: Sean Owen (cherry picked from commit 1b780f364bfbb46944fe805a024bb6c32f5d2dde) Signed-off-by: Sean Owen --- docs/_data/menu-sql.yaml | 8 +-- docs/sql-performance-tuning.md | 4 ++ docs/sql-ref-syntax-qry-select-hints.md| 83 -- docs/sql-ref-syntax-qry-select-join.md | 2 +- ...ng.md => sql-ref-syntax-qry-select-sampling.md} | 0 ...ndow.md => sql-ref-syntax-qry-select-window.md} | 0 docs/sql-ref-syntax-qry-select.md | 6 +- docs/sql-ref-syntax-qry.md | 6 +- docs/sql-ref-syntax.md | 6 +- 9 files changed, 95 insertions(+), 20 deletions(-) diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml index 289a9d3..219e680 100644 --- a/docs/_data/menu-sql.yaml +++ b/docs/_data/menu-sql.yaml @@ -171,22 +171,22 @@ url: sql-ref-syntax-qry-select-limit.html - text: Common Table Expression url: sql-ref-syntax-qry-select-cte.html +- text: Hints + url: sql-ref-syntax-qry-select-hints.html - text: Inline Table url: sql-ref-syntax-qry-select-inline-table.html - text: JOIN url: sql-ref-syntax-qry-select-join.html -- text: Join Hints - url: sql-ref-syntax-qry-select-hints.html - text: LIKE Predicate url: sql-ref-syntax-qry-select-like.html - text: Set Operators url: sql-ref-syntax-qry-select-setops.html - text: TABLESAMPLE - url: sql-ref-syntax-qry-sampling.html + url: sql-ref-syntax-qry-select-sampling.html - text: Table-valued Function url: sql-ref-syntax-qry-select-tvf.html - text: Window Function - url: sql-ref-syntax-qry-window.html + url: sql-ref-syntax-qry-select-window.html - text: EXPLAIN url: sql-ref-syntax-qry-explain.html - text: Auxiliary Statements diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 5b784a5..5e6f049 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -179,6 +179,8 @@ SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON r.key = s.key +For more details please refer to the documentation of [Join Hints](sql-ref-syntax-qry-select-hints.html#join-hints). + ## Coalesce Hints for SQL Queries Coalesce hints allows the Spark SQL users to control the number of output files just like the @@ -194,6 +196,8 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t +For more details please refer to the documentation of [Partitioning Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints). + ## Adaptive Query Execution Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, converting sort-merge join to br
[spark] branch master updated (b9737c3 -> 1b780f3)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b9737c3 [SPARK-31864][SQL] Adjust AQE skew join trigger condition add 1b780f3 [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference No new revisions were added by this update. Summary of changes: docs/_data/menu-sql.yaml | 8 +-- docs/sql-performance-tuning.md | 4 +- docs/sql-ref-syntax-qry-select-hints.md| 83 -- docs/sql-ref-syntax-qry-select-join.md | 2 +- ...ng.md => sql-ref-syntax-qry-select-sampling.md} | 0 ...ndow.md => sql-ref-syntax-qry-select-window.md} | 0 docs/sql-ref-syntax-qry-select.md | 6 +- docs/sql-ref-syntax-qry.md | 6 +- docs/sql-ref-syntax.md | 6 +- 9 files changed, 94 insertions(+), 21 deletions(-) rename docs/{sql-ref-syntax-qry-sampling.md => sql-ref-syntax-qry-select-sampling.md} (100%) rename docs/{sql-ref-syntax-qry-window.md => sql-ref-syntax-qry-select-window.md} (100%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b9737c3 -> 1b780f3)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b9737c3 [SPARK-31864][SQL] Adjust AQE skew join trigger condition add 1b780f3 [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference No new revisions were added by this update. Summary of changes: docs/_data/menu-sql.yaml | 8 +-- docs/sql-performance-tuning.md | 4 +- docs/sql-ref-syntax-qry-select-hints.md| 83 -- docs/sql-ref-syntax-qry-select-join.md | 2 +- ...ng.md => sql-ref-syntax-qry-select-sampling.md} | 0 ...ndow.md => sql-ref-syntax-qry-select-window.md} | 0 docs/sql-ref-syntax-qry-select.md | 6 +- docs/sql-ref-syntax-qry.md | 6 +- docs/sql-ref-syntax.md | 6 +- 9 files changed, 94 insertions(+), 21 deletions(-) rename docs/{sql-ref-syntax-qry-sampling.md => sql-ref-syntax-qry-select-sampling.md} (100%) rename docs/{sql-ref-syntax-qry-window.md => sql-ref-syntax-qry-select-window.md} (100%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (765105b -> 50492c0)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 765105b [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages add 50492c0 [SPARK-31803][ML] Make sure instance weight is not negative No new revisions were added by this update. Summary of changes: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala | 3 ++- .../main/scala/org/apache/spark/ml/classification/NaiveBayes.scala | 5 +++-- .../scala/org/apache/spark/ml/clustering/BisectingKMeans.scala | 3 ++- .../scala/org/apache/spark/ml/clustering/GaussianMixture.scala | 3 ++- mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala | 3 ++- .../apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala | 3 ++- .../scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala | 3 ++- .../scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala | 2 -- .../spark/ml/evaluation/MulticlassClassificationEvaluator.scala| 3 ++- .../scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala | 4 +++- mllib/src/main/scala/org/apache/spark/ml/functions.scala | 6 ++ .../apache/spark/ml/regression/GeneralizedLinearRegression.scala | 3 ++- .../scala/org/apache/spark/ml/regression/IsotonicRegression.scala | 7 --- 13 files changed, 32 insertions(+), 16 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (765105b -> 50492c0)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 765105b [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages add 50492c0 [SPARK-31803][ML] Make sure instance weight is not negative No new revisions were added by this update. Summary of changes: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala | 3 ++- .../main/scala/org/apache/spark/ml/classification/NaiveBayes.scala | 5 +++-- .../scala/org/apache/spark/ml/clustering/BisectingKMeans.scala | 3 ++- .../scala/org/apache/spark/ml/clustering/GaussianMixture.scala | 3 ++- mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala | 3 ++- .../apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala | 3 ++- .../scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala | 3 ++- .../scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala | 2 -- .../spark/ml/evaluation/MulticlassClassificationEvaluator.scala| 3 ++- .../scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala | 4 +++- mllib/src/main/scala/org/apache/spark/ml/functions.scala | 6 ++ .../apache/spark/ml/regression/GeneralizedLinearRegression.scala | 3 ++- .../scala/org/apache/spark/ml/regression/IsotonicRegression.scala | 7 --- 13 files changed, 32 insertions(+), 16 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8f2b6f3 -> 765105b)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8f2b6f3 [SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in schema for expression add 765105b [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages No new revisions were added by this update. Summary of changes: .../org/apache/spark/ui/jobs/AllJobsPage.scala | 23 +-- .../scala/org/apache/spark/ui/jobs/StagePage.scala | 14 +- .../org/apache/spark/ui/jobs/StageTable.scala | 21 +-- .../scala/org/apache/spark/ui/StagePageSuite.scala | 1 - .../spark/sql/execution/ui/AllExecutionsPage.scala | 29 +--- .../hive/thriftserver/ui/ThriftServerPage.scala| 164 + 6 files changed, 93 insertions(+), 159 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8f2b6f3 -> 765105b)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8f2b6f3 [SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in schema for expression add 765105b [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages No new revisions were added by this update. Summary of changes: .../org/apache/spark/ui/jobs/AllJobsPage.scala | 23 +-- .../scala/org/apache/spark/ui/jobs/StagePage.scala | 14 +- .../org/apache/spark/ui/jobs/StageTable.scala | 21 +-- .../scala/org/apache/spark/ui/StagePageSuite.scala | 1 - .../spark/sql/execution/ui/AllExecutionsPage.scala | 29 +--- .../hive/thriftserver/ui/ThriftServerPage.scala| 164 + 6 files changed, 93 insertions(+), 159 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7f36310 -> d400777)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7f36310 [SPARK-31802][SQL] Format Java date-time types in `Row.jsonValue` directly add d400777 [SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator No new revisions were added by this update. Summary of changes: .../spark/ml/evaluation/ClusteringEvaluator.scala | 34 -- .../spark/ml/evaluation/ClusteringMetrics.scala| 128 - .../ml/evaluation/ClusteringEvaluatorSuite.scala | 43 ++- python/pyspark/ml/evaluation.py| 29 - 4 files changed, 167 insertions(+), 67 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7f36310 -> d400777)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7f36310 [SPARK-31802][SQL] Format Java date-time types in `Row.jsonValue` directly add d400777 [SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator No new revisions were added by this update. Summary of changes: .../spark/ml/evaluation/ClusteringEvaluator.scala | 34 -- .../spark/ml/evaluation/ClusteringMetrics.scala| 128 - .../ml/evaluation/ClusteringEvaluatorSuite.scala | 43 ++- python/pyspark/ml/evaluation.py| 29 - 4 files changed, 167 insertions(+), 67 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7f36310 -> d400777)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7f36310 [SPARK-31802][SQL] Format Java date-time types in `Row.jsonValue` directly add d400777 [SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator No new revisions were added by this update. Summary of changes: .../spark/ml/evaluation/ClusteringEvaluator.scala | 34 -- .../spark/ml/evaluation/ClusteringMetrics.scala| 128 - .../ml/evaluation/ClusteringEvaluatorSuite.scala | 43 ++- python/pyspark/ml/evaluation.py| 29 - 4 files changed, 167 insertions(+), 67 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d400777 [SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator d400777 is described below commit d4007776f2dd85f03f3811ab8ca711f221f62c00 Author: Huaxin Gao AuthorDate: Mon May 25 09:18:08 2020 -0500 [SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator ### What changes were proposed in this pull request? Add weight support in ClusteringEvaluator ### Why are the changes needed? Currently, BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator support instance weight, but ClusteringEvaluator doesn't, so we will add instance weight support in ClusteringEvaluator. ### Does this PR introduce _any_ user-facing change? Yes. ClusteringEvaluator.setWeightCol ### How was this patch tested? add new unit test Closes #28553 from huaxingao/weight_evaluator. Authored-by: Huaxin Gao Signed-off-by: Sean Owen --- .../spark/ml/evaluation/ClusteringEvaluator.scala | 34 -- .../spark/ml/evaluation/ClusteringMetrics.scala| 128 - .../ml/evaluation/ClusteringEvaluatorSuite.scala | 43 ++- python/pyspark/ml/evaluation.py| 29 - 4 files changed, 167 insertions(+), 67 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala index 63b99a0..19790fd 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala @@ -19,10 +19,11 @@ package org.apache.spark.ml.evaluation import org.apache.spark.annotation.Since import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} -import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol, HasWeightCol} import org.apache.spark.ml.util._ import org.apache.spark.sql.Dataset -import org.apache.spark.sql.functions.col +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.DoubleType /** * Evaluator for clustering results. @@ -34,7 +35,8 @@ import org.apache.spark.sql.functions.col */ @Since("2.3.0") class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) - extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + extends Evaluator with HasPredictionCol with HasFeaturesCol with HasWeightCol +with DefaultParamsWritable { @Since("2.3.0") def this() = this(Identifiable.randomUID("cluEval")) @@ -53,6 +55,10 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str @Since("2.3.0") def setFeaturesCol(value: String): this.type = set(featuresCol, value) + /** @group setParam */ + @Since("3.1.0") + def setWeightCol(value: String): this.type = set(weightCol, value) + /** * param for metric name in evaluation * (supports `"silhouette"` (default)) @@ -116,12 +122,26 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: Str */ @Since("3.1.0") def getMetrics(dataset: Dataset[_]): ClusteringMetrics = { -SchemaUtils.validateVectorCompatibleColumn(dataset.schema, $(featuresCol)) -SchemaUtils.checkNumericType(dataset.schema, $(predictionCol)) +val schema = dataset.schema +SchemaUtils.validateVectorCompatibleColumn(schema, $(featuresCol)) +SchemaUtils.checkNumericType(schema, $(predictionCol)) +if (isDefined(weightCol)) { + SchemaUtils.checkNumericType(schema, $(weightCol)) +} + +val weightColName = if (!isDefined(weightCol)) "weightCol" else $(weightCol) val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol)) -val df = dataset.select(col($(predictionCol)), - vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata)) +val df = if (!isDefined(weightCol) || $(weightCol).isEmpty) { + dataset.select(col($(predictionCol)), +vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata), +lit(1.0).as(weightColName)) +} else { + dataset.select(col($(predictionCol)), +vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata), +col(weightColName).cast(DoubleType)) +} + val metrics = new ClusteringMetrics(df) metrics.setDistanceMeasure($(distanceMeasure)) metrics diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/Clust
[spark] branch master updated (cf7463f -> d0fe433)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from cf7463f [SPARK-31761][SQL] cast integer to Long to avoid IntegerOverflow for IntegralDivide operator add d0fe433 [SPARK-31768][ML] add getMetrics in Evaluators No new revisions were added by this update. Summary of changes: .../evaluation/BinaryClassificationEvaluator.scala | 26 +- .../spark/ml/evaluation/ClusteringEvaluator.scala | 559 + ...ringEvaluator.scala => ClusteringMetrics.scala} | 173 ++- .../MulticlassClassificationEvaluator.scala| 54 +- .../MultilabelClassificationEvaluator.scala| 36 +- .../spark/ml/evaluation/RankingEvaluator.scala | 28 +- .../spark/ml/evaluation/RegressionEvaluator.scala | 28 +- .../spark/mllib/evaluation/MulticlassMetrics.scala | 3 +- .../BinaryClassificationEvaluatorSuite.scala | 23 + .../ml/evaluation/ClusteringEvaluatorSuite.scala | 16 + .../MulticlassClassificationEvaluatorSuite.scala | 29 ++ .../MultilabelClassificationEvaluatorSuite.scala | 48 ++ .../ml/evaluation/RankingEvaluatorSuite.scala | 38 ++ .../ml/evaluation/RegressionEvaluatorSuite.scala | 33 ++ 14 files changed, 375 insertions(+), 719 deletions(-) copy mllib/src/main/scala/org/apache/spark/ml/evaluation/{ClusteringEvaluator.scala => ClusteringMetrics.scala} (80%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (cf7463f -> d0fe433)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from cf7463f [SPARK-31761][SQL] cast integer to Long to avoid IntegerOverflow for IntegralDivide operator add d0fe433 [SPARK-31768][ML] add getMetrics in Evaluators No new revisions were added by this update. Summary of changes: .../evaluation/BinaryClassificationEvaluator.scala | 26 +- .../spark/ml/evaluation/ClusteringEvaluator.scala | 559 + ...ringEvaluator.scala => ClusteringMetrics.scala} | 173 ++- .../MulticlassClassificationEvaluator.scala| 54 +- .../MultilabelClassificationEvaluator.scala| 36 +- .../spark/ml/evaluation/RankingEvaluator.scala | 28 +- .../spark/ml/evaluation/RegressionEvaluator.scala | 28 +- .../spark/mllib/evaluation/MulticlassMetrics.scala | 3 +- .../BinaryClassificationEvaluatorSuite.scala | 23 + .../ml/evaluation/ClusteringEvaluatorSuite.scala | 16 + .../MulticlassClassificationEvaluatorSuite.scala | 29 ++ .../MultilabelClassificationEvaluatorSuite.scala | 48 ++ .../ml/evaluation/RankingEvaluatorSuite.scala | 38 ++ .../ml/evaluation/RegressionEvaluatorSuite.scala | 33 ++ 14 files changed, 375 insertions(+), 719 deletions(-) copy mllib/src/main/scala/org/apache/spark/ml/evaluation/{ClusteringEvaluator.scala => ClusteringMetrics.scala} (80%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (892b600 -> d955708)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 892b600 [SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark add d955708 [SPARK-31756][WEBUI] Add real headless browser support for UI test No new revisions were added by this update. Summary of changes: .../tags/{DockerTest.java => ChromeUITest.java}| 3 +- .../apache/spark/ui/ChromeUISeleniumSuite.scala| 29 +++--- .../spark/ui/RealBrowserUISeleniumSuite.scala | 109 + .../org/apache/spark/ui/UISeleniumSuite.scala | 27 - dev/run-tests.py | 5 + pom.xml| 2 + 6 files changed, 132 insertions(+), 43 deletions(-) copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => ChromeUITest.java} (96%) copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%) create mode 100644 core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (892b600 -> d955708)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 892b600 [SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark add d955708 [SPARK-31756][WEBUI] Add real headless browser support for UI test No new revisions were added by this update. Summary of changes: .../tags/{DockerTest.java => ChromeUITest.java}| 3 +- .../apache/spark/ui/ChromeUISeleniumSuite.scala| 29 +++--- .../spark/ui/RealBrowserUISeleniumSuite.scala | 109 + .../org/apache/spark/ui/UISeleniumSuite.scala | 27 - dev/run-tests.py | 5 + pom.xml| 2 + 6 files changed, 132 insertions(+), 43 deletions(-) copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => ChromeUITest.java} (96%) copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%) create mode 100644 core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (892b600 -> d955708)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 892b600 [SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark add d955708 [SPARK-31756][WEBUI] Add real headless browser support for UI test No new revisions were added by this update. Summary of changes: .../tags/{DockerTest.java => ChromeUITest.java}| 3 +- .../apache/spark/ui/ChromeUISeleniumSuite.scala| 29 +++--- .../spark/ui/RealBrowserUISeleniumSuite.scala | 109 + .../org/apache/spark/ui/UISeleniumSuite.scala | 27 - dev/run-tests.py | 5 + pom.xml| 2 + 6 files changed, 132 insertions(+), 43 deletions(-) copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => ChromeUITest.java} (96%) copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%) create mode 100644 core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (892b600 -> d955708)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 892b600 [SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark add d955708 [SPARK-31756][WEBUI] Add real headless browser support for UI test No new revisions were added by this update. Summary of changes: .../tags/{DockerTest.java => ChromeUITest.java}| 3 +- .../apache/spark/ui/ChromeUISeleniumSuite.scala| 29 +++--- .../spark/ui/RealBrowserUISeleniumSuite.scala | 109 + .../org/apache/spark/ui/UISeleniumSuite.scala | 27 - dev/run-tests.py | 5 + pom.xml| 2 + 6 files changed, 132 insertions(+), 43 deletions(-) copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => ChromeUITest.java} (96%) copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%) create mode 100644 core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (892b600 -> d955708)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 892b600 [SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark add d955708 [SPARK-31756][WEBUI] Add real headless browser support for UI test No new revisions were added by this update. Summary of changes: .../tags/{DockerTest.java => ChromeUITest.java}| 3 +- .../apache/spark/ui/ChromeUISeleniumSuite.scala| 29 +++--- .../spark/ui/RealBrowserUISeleniumSuite.scala | 109 + .../org/apache/spark/ui/UISeleniumSuite.scala | 27 - dev/run-tests.py | 5 + pom.xml| 2 + 6 files changed, 132 insertions(+), 43 deletions(-) copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => ChromeUITest.java} (96%) copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%) create mode 100644 core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dae7988 -> f1495c5)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dae7988 [SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener add f1495c5 [SPARK-31688][WEBUI] Refactor Pagination framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ui/PagedTable.scala | 101 - .../org/apache/spark/ui/jobs/AllJobsPage.scala | 128 ++- .../org/apache/spark/ui/jobs/StageTable.scala | 114 ++ .../org/apache/spark/ui/storage/RDDPage.scala | 64 ++ .../spark/sql/execution/ui/AllExecutionsPage.scala | 123 ++ .../hive/thriftserver/ui/ThriftServerPage.scala| 251 + .../thriftserver/ui/ThriftServerSessionPage.scala | 29 +-- 7 files changed, 238 insertions(+), 572 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dae7988 -> f1495c5)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dae7988 [SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener add f1495c5 [SPARK-31688][WEBUI] Refactor Pagination framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ui/PagedTable.scala | 101 - .../org/apache/spark/ui/jobs/AllJobsPage.scala | 128 ++- .../org/apache/spark/ui/jobs/StageTable.scala | 114 ++ .../org/apache/spark/ui/storage/RDDPage.scala | 64 ++ .../spark/sql/execution/ui/AllExecutionsPage.scala | 123 ++ .../hive/thriftserver/ui/ThriftServerPage.scala| 251 + .../thriftserver/ui/ThriftServerSessionPage.scala | 29 +-- 7 files changed, 238 insertions(+), 572 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dae7988 -> f1495c5)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dae7988 [SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener add f1495c5 [SPARK-31688][WEBUI] Refactor Pagination framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ui/PagedTable.scala | 101 - .../org/apache/spark/ui/jobs/AllJobsPage.scala | 128 ++- .../org/apache/spark/ui/jobs/StageTable.scala | 114 ++ .../org/apache/spark/ui/storage/RDDPage.scala | 64 ++ .../spark/sql/execution/ui/AllExecutionsPage.scala | 123 ++ .../hive/thriftserver/ui/ThriftServerPage.scala| 251 + .../thriftserver/ui/ThriftServerSessionPage.scala | 29 +-- 7 files changed, 238 insertions(+), 572 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (dae7988 -> f1495c5)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from dae7988 [SPARK-31354] SparkContext only register one SparkSession ApplicationEnd listener add f1495c5 [SPARK-31688][WEBUI] Refactor Pagination framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ui/PagedTable.scala | 101 - .../org/apache/spark/ui/jobs/AllJobsPage.scala | 128 ++- .../org/apache/spark/ui/jobs/StageTable.scala | 114 ++ .../org/apache/spark/ui/storage/RDDPage.scala | 64 ++ .../spark/sql/execution/ui/AllExecutionsPage.scala | 123 ++ .../hive/thriftserver/ui/ThriftServerPage.scala| 251 + .../thriftserver/ui/ThriftServerSessionPage.scala | 29 +-- 7 files changed, 238 insertions(+), 572 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (d2bec5e -> 097d509)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d2bec5e [SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax add 097d509 [MINOR] Fix a typo in FsHistoryProvider loginfo No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR] Fix a typo in FsHistoryProvider loginfo
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 097d509 [MINOR] Fix a typo in FsHistoryProvider loginfo 097d509 is described below commit 097d5098cca987e5f7bbb8394783c01517ebed0f Author: Sungpeo Kook AuthorDate: Sun May 17 09:43:01 2020 -0500 [MINOR] Fix a typo in FsHistoryProvider loginfo ## What changes were proposed in this pull request? a typo in logging. (just added `: `) Closes #28505 from sungpeo/typo_fshistoryprovider. Authored-by: Sungpeo Kook Signed-off-by: Sean Owen --- .../main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala index 99d3ece..25ea75a 100644 --- a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala +++ b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala @@ -108,7 +108,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) private val historyUiAdminAclsGroups = conf.get(History.HISTORY_SERVER_UI_ADMIN_ACLS_GROUPS) logInfo(s"History server ui acls " + (if (historyUiAclsEnable) "enabled" else "disabled") + "; users with admin permissions: " + historyUiAdminAcls.mkString(",") + -"; groups with admin permissions" + historyUiAdminAclsGroups.mkString(",")) +"; groups with admin permissions: " + historyUiAdminAclsGroups.mkString(",")) private val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf) // Visible for testing - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5d90886 -> 194ac3b)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5d90886 [SPARK-31716][SQL] Use fallback versions in HiveExternalCatalogVersionsSuite add 194ac3b [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector No new revisions were added by this update. Summary of changes: docs/ml-features.md| 140 + docs/ml-statistics.md | 56 - ...rExample.java => JavaANOVASelectorExample.java} | 35 +++--- .../spark/examples/ml/JavaANOVATestExample.java| 2 +- ...Example.java => JavaFValueSelectorExample.java} | 34 ++--- .../spark/examples/ml/JavaFValueTestExample.java | 2 +- ...lector_example.py => anova_selector_example.py} | 24 ++-- examples/src/main/python/ml/anova_test_example.py | 2 +- ...ector_example.py => fvalue_selector_example.py} | 26 ++-- examples/src/main/python/ml/fvalue_test_example.py | 2 +- ...torExample.scala => ANOVASelectorExample.scala} | 30 +++-- .../spark/examples/ml/ANOVATestExample.scala | 2 +- ...orExample.scala => FValueSelectorExample.scala} | 30 +++-- ...ueTestExample.scala => FValueTestExample.scala} | 0 14 files changed, 308 insertions(+), 77 deletions(-) copy examples/src/main/java/org/apache/spark/examples/ml/{JavaChiSqSelectorExample.java => JavaANOVASelectorExample.java} (66%) copy examples/src/main/java/org/apache/spark/examples/ml/{JavaVarianceThresholdSelectorExample.java => JavaFValueSelectorExample.java} (76%) copy examples/src/main/python/ml/{chisq_selector_example.py => anova_selector_example.py} (62%) copy examples/src/main/python/ml/{variance_threshold_selector_example.py => fvalue_selector_example.py} (58%) copy examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala => ANOVASelectorExample.scala} (64%) copy examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala => FValueSelectorExample.scala} (62%) rename examples/src/main/scala/org/apache/spark/examples/ml/{FVlaueTestExample.scala => FValueTestExample.scala} (100%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5d90886 -> 194ac3b)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5d90886 [SPARK-31716][SQL] Use fallback versions in HiveExternalCatalogVersionsSuite add 194ac3b [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector No new revisions were added by this update. Summary of changes: docs/ml-features.md| 140 + docs/ml-statistics.md | 56 - ...rExample.java => JavaANOVASelectorExample.java} | 35 +++--- .../spark/examples/ml/JavaANOVATestExample.java| 2 +- ...Example.java => JavaFValueSelectorExample.java} | 34 ++--- .../spark/examples/ml/JavaFValueTestExample.java | 2 +- ...lector_example.py => anova_selector_example.py} | 24 ++-- examples/src/main/python/ml/anova_test_example.py | 2 +- ...ector_example.py => fvalue_selector_example.py} | 26 ++-- examples/src/main/python/ml/fvalue_test_example.py | 2 +- ...torExample.scala => ANOVASelectorExample.scala} | 30 +++-- .../spark/examples/ml/ANOVATestExample.scala | 2 +- ...orExample.scala => FValueSelectorExample.scala} | 30 +++-- ...ueTestExample.scala => FValueTestExample.scala} | 0 14 files changed, 308 insertions(+), 77 deletions(-) copy examples/src/main/java/org/apache/spark/examples/ml/{JavaChiSqSelectorExample.java => JavaANOVASelectorExample.java} (66%) copy examples/src/main/java/org/apache/spark/examples/ml/{JavaVarianceThresholdSelectorExample.java => JavaFValueSelectorExample.java} (76%) copy examples/src/main/python/ml/{chisq_selector_example.py => anova_selector_example.py} (62%) copy examples/src/main/python/ml/{variance_threshold_selector_example.py => fvalue_selector_example.py} (58%) copy examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala => ANOVASelectorExample.scala} (64%) copy examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala => FValueSelectorExample.scala} (62%) rename examples/src/main/scala/org/apache/spark/examples/ml/{FVlaueTestExample.scala => FValueTestExample.scala} (100%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (5d90886 -> 194ac3b)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5d90886 [SPARK-31716][SQL] Use fallback versions in HiveExternalCatalogVersionsSuite add 194ac3b [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector No new revisions were added by this update. Summary of changes: docs/ml-features.md| 140 + docs/ml-statistics.md | 56 - ...rExample.java => JavaANOVASelectorExample.java} | 35 +++--- .../spark/examples/ml/JavaANOVATestExample.java| 2 +- ...Example.java => JavaFValueSelectorExample.java} | 34 ++--- .../spark/examples/ml/JavaFValueTestExample.java | 2 +- ...lector_example.py => anova_selector_example.py} | 24 ++-- examples/src/main/python/ml/anova_test_example.py | 2 +- ...ector_example.py => fvalue_selector_example.py} | 26 ++-- examples/src/main/python/ml/fvalue_test_example.py | 2 +- ...torExample.scala => ANOVASelectorExample.scala} | 30 +++-- .../spark/examples/ml/ANOVATestExample.scala | 2 +- ...orExample.scala => FValueSelectorExample.scala} | 30 +++-- ...ueTestExample.scala => FValueTestExample.scala} | 0 14 files changed, 308 insertions(+), 77 deletions(-) copy examples/src/main/java/org/apache/spark/examples/ml/{JavaChiSqSelectorExample.java => JavaANOVASelectorExample.java} (66%) copy examples/src/main/java/org/apache/spark/examples/ml/{JavaVarianceThresholdSelectorExample.java => JavaFValueSelectorExample.java} (76%) copy examples/src/main/python/ml/{chisq_selector_example.py => anova_selector_example.py} (62%) copy examples/src/main/python/ml/{variance_threshold_selector_example.py => fvalue_selector_example.py} (58%) copy examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala => ANOVASelectorExample.scala} (64%) copy examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala => FValueSelectorExample.scala} (62%) rename examples/src/main/scala/org/apache/spark/examples/ml/{FVlaueTestExample.scala => FValueTestExample.scala} (100%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 194ac3b [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector 194ac3b is described below commit 194ac3be8bd8ca1b5e463074ed61420f185e8caf Author: Huaxin Gao AuthorDate: Fri May 15 09:59:14 2020 -0500 [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector ### What changes were proposed in this pull request? Add docs and examples for ANOVASelector and FValueSelector ### Why are the changes needed? Complete the implementation of ANOVASelector and FValueSelector ### Does this PR introduce _any_ user-facing change? Yes https://user-images.githubusercontent.com/13592258/81878703-b4f94480-953d-11ea-9166-da3c64852b90.png;> https://user-images.githubusercontent.com/13592258/81878600-6055c980-953d-11ea-8b24-09c31647139b.png;> https://user-images.githubusercontent.com/13592258/81878603-621f8d00-953d-11ea-9447-39913ccc067d.png;> https://user-images.githubusercontent.com/13592258/81878606-65b31400-953d-11ea-9d76-51859266d1a8.png;> https://user-images.githubusercontent.com/13592258/81878611-69df3180-953d-11ea-8618-23a2a6cfd730.png;> https://user-images.githubusercontent.com/13592258/81878620-6cda2200-953d-11ea-9c46-da763328364e.png;> https://user-images.githubusercontent.com/13592258/81878625-6f3c7c00-953d-11ea-9d11-2281b33a0bd8.png;> https://user-images.githubusercontent.com/13592258/81878882-13bebe00-953e-11ea-9776-288bac97d93f.png;> https://user-images.githubusercontent.com/13592258/81878637-76638a00-953d-11ea-94b0-dc9bc85ae2b7.png;> https://user-images.githubusercontent.com/13592258/81878640-79f71100-953d-11ea-9a66-b27f9482fbd3.png;> https://user-images.githubusercontent.com/13592258/81878644-7cf20180-953d-11ea-9142-9658c8e90986.png;> https://user-images.githubusercontent.com/13592258/81878653-81b6b580-953d-11ea-9dc2-8015095cf569.png;> https://user-images.githubusercontent.com/13592258/81878658-854a3c80-953d-11ea-8dc9-217aa749fd00.png;> https://user-images.githubusercontent.com/13592258/81878659-87ac9680-953d-11ea-8c6b-74ab76748e4a.png;> https://user-images.githubusercontent.com/13592258/81878664-8b401d80-953d-11ea-9ee1-05f6677e263c.png;> https://user-images.githubusercontent.com/13592258/81878669-8da27780-953d-11ea-8216-77eb8bb7e091.png;> ### How was this patch tested? Manually build and check Closes #28524 from huaxingao/examples. Authored-by: Huaxin Gao Signed-off-by: Sean Owen --- docs/ml-features.md| 140 + docs/ml-statistics.md | 56 - ...tExample.java => JavaANOVASelectorExample.java} | 48 +++ .../spark/examples/ml/JavaANOVATestExample.java| 2 +- ...Example.java => JavaFValueSelectorExample.java} | 48 +++ .../spark/examples/ml/JavaFValueTestExample.java | 2 +- ...a_test_example.py => anova_selector_example.py} | 35 +++--- examples/src/main/python/ml/anova_test_example.py | 2 +- ..._test_example.py => fvalue_selector_example.py} | 35 +++--- examples/src/main/python/ml/fvalue_test_example.py | 2 +- ...estExample.scala => ANOVASelectorExample.scala} | 42 --- .../spark/examples/ml/ANOVATestExample.scala | 2 +- ...stExample.scala => FValueSelectorExample.scala} | 42 --- ...ueTestExample.scala => FValueTestExample.scala} | 0 14 files changed, 340 insertions(+), 116 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 65b60be..660c272 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1793,6 +1793,146 @@ for more details on the API. +## ANOVASelector + +`ANOVASelector` operates on categorical labels with continuous features. It uses the +[one-way ANOVA F-test](https://en.wikipedia.org/wiki/F-test#Multiple-comparison_ANOVA_problems) to decide which +features to choose. +It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: +* `numTopFeatures` chooses a fixed number of top features according to ANOVA F-test. +* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. +* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. +* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. +* `fwe` chooses all features whose p-values a
[spark] branch branch-3.0 updated: [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 6834f46 [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary 6834f46 is described below commit 6834f4691b3e2603d410bfe24f0db0b6e3a36446 Author: Huaxin Gao AuthorDate: Thu May 14 10:54:35 2020 -0500 [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary ### What changes were proposed in this pull request? Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark ### Why are the changes needed? Currently we have ``` since("2.0.0") def evaluate(self, dataset): if not isinstance(dataset, DataFrame): raise ValueError("dataset must be a DataFrame but got %s." % type(dataset)) java_blr_summary = self._call_java("evaluate", dataset) return BinaryLogisticRegressionSummary(java_blr_summary) ``` we should return LogisticRegressionSummary for multiclass logistic regression ### Does this PR introduce _any_ user-facing change? Yes return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python ### How was this patch tested? unit test Closes #28503 from huaxingao/lr_summary. Authored-by: Huaxin Gao Signed-off-by: Sean Owen (cherry picked from commit e10516ae63cfc58f2d493e4d3f19940d45c8f033) Signed-off-by: Sean Owen --- python/pyspark/ml/classification.py | 5 - python/pyspark/ml/tests/test_training_summary.py | 6 +- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 1436b78..424e16a 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -831,7 +831,10 @@ class LogisticRegressionModel(JavaProbabilisticClassificationModel, _LogisticReg if not isinstance(dataset, DataFrame): raise ValueError("dataset must be a DataFrame but got %s." % type(dataset)) java_blr_summary = self._call_java("evaluate", dataset) -return BinaryLogisticRegressionSummary(java_blr_summary) +if self.numClasses <= 2: +return BinaryLogisticRegressionSummary(java_blr_summary) +else: +return LogisticRegressionSummary(java_blr_summary) class LogisticRegressionSummary(JavaWrapper): diff --git a/python/pyspark/ml/tests/test_training_summary.py b/python/pyspark/ml/tests/test_training_summary.py index 1d19ebf..b505409 100644 --- a/python/pyspark/ml/tests/test_training_summary.py +++ b/python/pyspark/ml/tests/test_training_summary.py @@ -21,7 +21,8 @@ import unittest if sys.version > '3': basestring = str -from pyspark.ml.classification import LogisticRegression +from pyspark.ml.classification import BinaryLogisticRegressionSummary, LogisticRegression, \ +LogisticRegressionSummary from pyspark.ml.clustering import BisectingKMeans, GaussianMixture, KMeans from pyspark.ml.linalg import Vectors from pyspark.ml.regression import GeneralizedLinearRegression, LinearRegression @@ -149,6 +150,7 @@ class TrainingSummaryTest(SparkSessionTestCase): # test evaluation (with training dataset) produces a summary with same values # one check is enough to verify a summary is returned, Scala version runs full test sameSummary = model.evaluate(df) +self.assertTrue(isinstance(sameSummary, BinaryLogisticRegressionSummary)) self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC) def test_multiclass_logistic_regression_summary(self): @@ -187,6 +189,8 @@ class TrainingSummaryTest(SparkSessionTestCase): # test evaluation (with training dataset) produces a summary with same values # one check is enough to verify a summary is returned, Scala version runs full test sameSummary = model.evaluate(df) +self.assertTrue(isinstance(sameSummary, LogisticRegressionSummary)) +self.assertFalse(isinstance(sameSummary, BinaryLogisticRegressionSummary)) self.assertAlmostEqual(sameSummary.accuracy, s.accuracy) def test_gaussian_mixture_summary(self): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 6834f46 [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary 6834f46 is described below commit 6834f4691b3e2603d410bfe24f0db0b6e3a36446 Author: Huaxin Gao AuthorDate: Thu May 14 10:54:35 2020 -0500 [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary ### What changes were proposed in this pull request? Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark ### Why are the changes needed? Currently we have ``` since("2.0.0") def evaluate(self, dataset): if not isinstance(dataset, DataFrame): raise ValueError("dataset must be a DataFrame but got %s." % type(dataset)) java_blr_summary = self._call_java("evaluate", dataset) return BinaryLogisticRegressionSummary(java_blr_summary) ``` we should return LogisticRegressionSummary for multiclass logistic regression ### Does this PR introduce _any_ user-facing change? Yes return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python ### How was this patch tested? unit test Closes #28503 from huaxingao/lr_summary. Authored-by: Huaxin Gao Signed-off-by: Sean Owen (cherry picked from commit e10516ae63cfc58f2d493e4d3f19940d45c8f033) Signed-off-by: Sean Owen --- python/pyspark/ml/classification.py | 5 - python/pyspark/ml/tests/test_training_summary.py | 6 +- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/python/pyspark/ml/classification.py b/python/pyspark/ml/classification.py index 1436b78..424e16a 100644 --- a/python/pyspark/ml/classification.py +++ b/python/pyspark/ml/classification.py @@ -831,7 +831,10 @@ class LogisticRegressionModel(JavaProbabilisticClassificationModel, _LogisticReg if not isinstance(dataset, DataFrame): raise ValueError("dataset must be a DataFrame but got %s." % type(dataset)) java_blr_summary = self._call_java("evaluate", dataset) -return BinaryLogisticRegressionSummary(java_blr_summary) +if self.numClasses <= 2: +return BinaryLogisticRegressionSummary(java_blr_summary) +else: +return LogisticRegressionSummary(java_blr_summary) class LogisticRegressionSummary(JavaWrapper): diff --git a/python/pyspark/ml/tests/test_training_summary.py b/python/pyspark/ml/tests/test_training_summary.py index 1d19ebf..b505409 100644 --- a/python/pyspark/ml/tests/test_training_summary.py +++ b/python/pyspark/ml/tests/test_training_summary.py @@ -21,7 +21,8 @@ import unittest if sys.version > '3': basestring = str -from pyspark.ml.classification import LogisticRegression +from pyspark.ml.classification import BinaryLogisticRegressionSummary, LogisticRegression, \ +LogisticRegressionSummary from pyspark.ml.clustering import BisectingKMeans, GaussianMixture, KMeans from pyspark.ml.linalg import Vectors from pyspark.ml.regression import GeneralizedLinearRegression, LinearRegression @@ -149,6 +150,7 @@ class TrainingSummaryTest(SparkSessionTestCase): # test evaluation (with training dataset) produces a summary with same values # one check is enough to verify a summary is returned, Scala version runs full test sameSummary = model.evaluate(df) +self.assertTrue(isinstance(sameSummary, BinaryLogisticRegressionSummary)) self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC) def test_multiclass_logistic_regression_summary(self): @@ -187,6 +189,8 @@ class TrainingSummaryTest(SparkSessionTestCase): # test evaluation (with training dataset) produces a summary with same values # one check is enough to verify a summary is returned, Scala version runs full test sameSummary = model.evaluate(df) +self.assertTrue(isinstance(sameSummary, LogisticRegressionSummary)) +self.assertFalse(isinstance(sameSummary, BinaryLogisticRegressionSummary)) self.assertAlmostEqual(sameSummary.accuracy, s.accuracy) def test_gaussian_mixture_summary(self): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b2300fc -> e10516a)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b2300fc [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) add e10516a [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary No new revisions were added by this update. Summary of changes: python/pyspark/ml/classification.py | 5 - python/pyspark/ml/tests/test_training_summary.py | 6 +- 2 files changed, 9 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b2300fc -> e10516a)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b2300fc [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) add e10516a [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary No new revisions were added by this update. Summary of changes: python/pyspark/ml/classification.py | 5 - python/pyspark/ml/tests/test_training_summary.py | 6 +- 2 files changed, 9 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 1ea5844 [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) 1ea5844 is described below commit 1ea584443e9372a6a0b3c8449f5bf7e9e1369b0d Author: Weichen Xu AuthorDate: Thu May 14 09:24:40 2020 -0500 [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0 ``` for (i <- 0 until splits.length) { if (splits(i) == -0.0) { splits(i) = 0.0 } } ``` Fix bug. No Unit test. ~~~scala import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) import spark.implicits._ val df1 = sc.parallelize(a1, 2).toDF("id") import org.apache.spark.ml.feature.QuantileDiscretizer val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0) val model = qd.fit(df1) // will raise error in spark master. ~~~ scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code. And array.distinct will rely on elem.hashCode so it leads to this error. Test code on distinct ``` import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) a1.distinct.sorted.foreach(x => print(x.toString + "\n")) ``` Then you will see output like: ``` ... -0.009292684662246975 -0.0033280686465135823 -0.0 0.0 0.0022219556032221366 0.02217419561977274 ... ``` Closes #28498 from WeichenXu123/SPARK-31676. Authored-by: Weichen Xu Signed-off-by: Sean Owen (cherry picked from commit b2300fca1e1a22d74c6eeda37942920a6c6299ff) Signed-off-by: Sean Owen --- .../apache/spark/ml/feature/QuantileDiscretizer.scala | 12 .../spark/ml/feature/QuantileDiscretizerSuite.scala| 18 ++ 2 files changed, 30 insertions(+) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala index 56e2c54..f3ec358 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala @@ -243,6 +243,18 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui private def getDistinctSplits(splits: Array[Double]): Array[Double] = { splits(0) = Double.NegativeInfinity splits(splits.length - 1) = Double.PositiveInfinity + +// 0.0 and -0.0 are distinct values, array.distinct will preserve both of them. +// but 0.0 > -0.0 is False which will break the parameter validation checking. +// and in scala <= 2.12, there's bug which will cause array.distinct generate +// non-deterministic results when array contains both 0.0 and -0.0 +// So that here we should first normalize all 0.0 and -0.0 to be 0.0 +// See https://github.com/scala/bug/issues/11995 +for (i <- 0 until splits.length) { + if (splits(i) == -0.0) { +splits(i) = 0.0 + } +} val distinctSplits = splits.distinct if (splits.length != distinctSplits.length) { log.warn(s"Some quantiles were identical. Bucketing to ${distinctSplits.length - 1}" + diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala index b009038..9c37416 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala @@ -443,4 +443,22 @@ class QuantileDiscretizerSuite extends MLTest with DefaultReadWriteTest { discretizer.fit(df) } } + + test("[SPARK-31676] QuantileDiscretizer raise error parameter splits given invalid value") { +import scala.util.Random +val rng = new Random(3) + +val a1 = Array.tabulate(200)(_ => rng.nextDouble * 2.0 - 1.0) ++ + Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) + +val
[spark] branch branch-3.0 updated: [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 00e6acc [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) 00e6acc is described below commit 00e6acc9c6d45c5dd3b3f70c87909743a8073dba Author: Weichen Xu AuthorDate: Thu May 14 09:24:40 2020 -0500 [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) ### What changes were proposed in this pull request? In QuantileDiscretizer.getDistinctSplits, before invoking distinct, normalize all -0.0 and 0.0 to be 0.0 ``` for (i <- 0 until splits.length) { if (splits(i) == -0.0) { splits(i) = 0.0 } } ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Manually test: ~~~scala import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) import spark.implicits._ val df1 = sc.parallelize(a1, 2).toDF("id") import org.apache.spark.ml.feature.QuantileDiscretizer val qd = new QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0) val model = qd.fit(df1) // will raise error in spark master. ~~~ ### Explain scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. This break the contract between equals() and hashCode() If two objects are equal, then they must have the same hash code. And array.distinct will rely on elem.hashCode so it leads to this error. Test code on distinct ``` import scala.util.Random val rng = new Random(3) val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ Array.fill(20)(0.0) ++ Array.fill(20)(-0.0) a1.distinct.sorted.foreach(x => print(x.toString + "\n")) ``` Then you will see output like: ``` ... -0.009292684662246975 -0.0033280686465135823 -0.0 0.0 0.0022219556032221366 0.02217419561977274 ... ``` Closes #28498 from WeichenXu123/SPARK-31676. Authored-by: Weichen Xu Signed-off-by: Sean Owen (cherry picked from commit b2300fca1e1a22d74c6eeda37942920a6c6299ff) Signed-off-by: Sean Owen --- .../apache/spark/ml/feature/QuantileDiscretizer.scala | 12 .../spark/ml/feature/QuantileDiscretizerSuite.scala| 18 ++ 2 files changed, 30 insertions(+) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala index 216d99d..4eedfc4 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala @@ -236,6 +236,18 @@ final class QuantileDiscretizer @Since("1.6.0") (@Since("1.6.0") override val ui private def getDistinctSplits(splits: Array[Double]): Array[Double] = { splits(0) = Double.NegativeInfinity splits(splits.length - 1) = Double.PositiveInfinity + +// 0.0 and -0.0 are distinct values, array.distinct will preserve both of them. +// but 0.0 > -0.0 is False which will break the parameter validation checking. +// and in scala <= 2.12, there's bug which will cause array.distinct generate +// non-deterministic results when array contains both 0.0 and -0.0 +// So that here we should first normalize all 0.0 and -0.0 to be 0.0 +// See https://github.com/scala/bug/issues/11995 +for (i <- 0 until splits.length) { + if (splits(i) == -0.0) { +splits(i) = 0.0 + } +} val distinctSplits = splits.distinct if (splits.length != distinctSplits.length) { log.warn(s"Some quantiles were identical. Bucketing to ${distinctSplits.length - 1}" + diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala index 6f6ab26..682b87a 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala @@ -512,4 +512,22 @@ class QuantileDiscretizerSuite extends MLTest with DefaultReadWriteTest { assert(observedNumBuckets === numBuckets, "Observed number of buckets does not equal expected number of
[spark] branch master updated (ddbce4e -> b2300fc)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ddbce4e [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination … add b2300fc [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0) No new revisions were added by this update. Summary of changes: .../apache/spark/ml/feature/QuantileDiscretizer.scala | 12 .../spark/ml/feature/QuantileDiscretizerSuite.scala| 18 ++ 2 files changed, 30 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 59d9099 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization 59d9099 is described below commit 59d90997a52f78450fefbc96beba1d731b3678a1 Author: Antonin Delpeuch AuthorDate: Tue May 12 08:27:43 2020 -0500 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization ### What changes were proposed in this pull request? This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back ([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)). I added two sentences about this in the RDD Programming Guide. The issue was discussed on the dev mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html ### Why are the changes needed? Because RDDs are order-aware collections, it is natural to expect that if I use `saveAsTextFile` and then load the resulting file with `sparkContext.textFile`, I obtain a RDD in the same order. This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this. ### Does this PR introduce _any_ user-facing change? Yes, two new sentences in the documentation. ### How was this patch tested? By checking that the documentation looks good. Closes #28465 from wetneb/SPARK-5300-docs. Authored-by: Antonin Delpeuch Signed-off-by: Sean Owen --- docs/rdd-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md index ba99007..70bfefc 100644 --- a/docs/rdd-programming-guide.md +++ b/docs/rdd-programming-guide.md @@ -360,7 +360,7 @@ Some notes on reading files with Spark: * If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system. -* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. +* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. When multiple files are read, the order of the partitions depends on the order the files are returned from the filesystem. It may or may not, for example, follow the lexicographic ordering of the files by path. Within a partiti [...] * The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 59d9099 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization 59d9099 is described below commit 59d90997a52f78450fefbc96beba1d731b3678a1 Author: Antonin Delpeuch AuthorDate: Tue May 12 08:27:43 2020 -0500 [MINOR][DOCS] Mention lack of RDD order preservation after deserialization ### What changes were proposed in this pull request? This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back ([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)). I added two sentences about this in the RDD Programming Guide. The issue was discussed on the dev mailing list: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html ### Why are the changes needed? Because RDDs are order-aware collections, it is natural to expect that if I use `saveAsTextFile` and then load the resulting file with `sparkContext.textFile`, I obtain a RDD in the same order. This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this. ### Does this PR introduce _any_ user-facing change? Yes, two new sentences in the documentation. ### How was this patch tested? By checking that the documentation looks good. Closes #28465 from wetneb/SPARK-5300-docs. Authored-by: Antonin Delpeuch Signed-off-by: Sean Owen --- docs/rdd-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md index ba99007..70bfefc 100644 --- a/docs/rdd-programming-guide.md +++ b/docs/rdd-programming-guide.md @@ -360,7 +360,7 @@ Some notes on reading files with Spark: * If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system. -* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. +* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. When multiple files are read, the order of the partitions depends on the order the files are returned from the filesystem. It may or may not, for example, follow the lexicographic ordering of the files by path. Within a partiti [...] * The `textFile` method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 1f85cd7 [SPARK-31671][ML] Wrong error message in VectorAssembler 1f85cd7 is described below commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9 Author: fan31415 AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 Co-authored-by: yijiefan Signed-off-by: Sean Owen (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 9192e72..994681a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -228,7 +228,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { +val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) +).toDF("n1", "n2") +val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) +val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol(&quo
[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 1f85cd7 [SPARK-31671][ML] Wrong error message in VectorAssembler 1f85cd7 is described below commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9 Author: fan31415 AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 Co-authored-by: yijiefan Signed-off-by: Sean Owen (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 9192e72..994681a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -228,7 +228,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { +val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) +).toDF("n1", "n2") +val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) +val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol(&quo
[spark] branch branch-3.0 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 7e226a2 [SPARK-31671][ML] Wrong error message in VectorAssembler 7e226a2 is described below commit 7e226a25efeaf083c95f04ee0d9c3a6e5b6d763d Author: fan31415 AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 Co-authored-by: yijiefan Signed-off-by: Sean Owen (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 3070012..7bc5e56 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -233,7 +233,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { +val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) +).toDF("n1", "n2") +val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) +val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol(&quo
[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 1f85cd7 [SPARK-31671][ML] Wrong error message in VectorAssembler 1f85cd7 is described below commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9 Author: fan31415 AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 Co-authored-by: yijiefan Signed-off-by: Sean Owen (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 9192e72..994681a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -228,7 +228,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { +val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) +).toDF("n1", "n2") +val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) +val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol(&quo
[spark] branch branch-3.0 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 7e226a2 [SPARK-31671][ML] Wrong error message in VectorAssembler 7e226a2 is described below commit 7e226a25efeaf083c95f04ee0d9c3a6e5b6d763d Author: fan31415 AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 Co-authored-by: yijiefan Signed-off-by: Sean Owen (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 3070012..7bc5e56 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -233,7 +233,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { +val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) +).toDF("n1", "n2") +val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) +val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol(&quo
[spark] branch master updated (d7c3e9e -> 64fb358)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d7c3e9e [SPARK-31456][CORE] Fix shutdown hook priority edge cases add 64fb358 [SPARK-31671][ML] Wrong error message in VectorAssembler No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 1f85cd7 [SPARK-31671][ML] Wrong error message in VectorAssembler 1f85cd7 is described below commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9 Author: fan31415 AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 Co-authored-by: yijiefan Signed-off-by: Sean Owen (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 9192e72..994681a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -228,7 +228,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { +val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) +).toDF("n1", "n2") +val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) +val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol(&quo
[spark] branch branch-3.0 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 7e226a2 [SPARK-31671][ML] Wrong error message in VectorAssembler 7e226a2 is described below commit 7e226a25efeaf083c95f04ee0d9c3a6e5b6d763d Author: fan31415 AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 Co-authored-by: yijiefan Signed-off-by: Sean Owen (cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853) Signed-off-by: Sean Owen --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 3070012..7bc5e56 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -233,7 +233,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { +val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) +).toDF("n1", "n2") +val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) +val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol(&quo
[spark] branch master updated (d7c3e9e -> 64fb358)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from d7c3e9e [SPARK-31456][CORE] Fix shutdown hook priority edge cases add 64fb358 [SPARK-31671][ML] Wrong error message in VectorAssembler No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31671][ML] Wrong error message in VectorAssembler
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 64fb358 [SPARK-31671][ML] Wrong error message in VectorAssembler 64fb358 is described below commit 64fb358a994d3fff651a742fa067c194b7455853 Author: fan31415 AuthorDate: Mon May 11 18:23:23 2020 -0500 [SPARK-31671][ML] Wrong error message in VectorAssembler ### What changes were proposed in this pull request? When input column lengths can not be inferred and handleInvalid = "keep", VectorAssembler will throw a runtime exception. However the error message with this exception is not consistent. I change the content of this error message to make it work properly. ### Why are the changes needed? This is a bug. Here is a simple example to reproduce it. ``` // create a df without vector size val df = Seq( (Vectors.dense(1.0), Vectors.dense(2.0)) ).toDF("n1", "n2") // only set vector size hint for n1 column val hintedDf = new VectorSizeHint() .setInputCol("n1") .setSize(1) .transform(df) // assemble n1, n2 val output = new VectorAssembler() .setInputCols(Array("n1", "n2")) .setOutputCol("features") .setHandleInvalid("keep") .transform(hintedDf) // because only n1 has vector size, the error message should tell us to set vector size for n2 too output.show() ``` Expected error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2]. ``` Actual error message: ``` Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2]. ``` This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test in VectorAssemblerSuite. Closes #28487 from fan31415/SPARK-31671. Lead-authored-by: fan31415 Co-authored-by: yijiefan Signed-off-by: Sean Owen --- .../scala/org/apache/spark/ml/feature/VectorAssembler.scala | 2 +- .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala index 3070012..7bc5e56 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala @@ -233,7 +233,7 @@ object VectorAssembler extends DefaultParamsReadable[VectorAssembler] { getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), missingColumns) case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException( s"""Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint - |to add metadata for columns: ${columns.mkString("[", ", ", "]")}.""" + |to add metadata for columns: ${missingColumns.mkString("[", ", ", "]")}.""" .stripMargin.replaceAll("\n", " ")) case (_, _) => Map.empty } diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala index a4d388f..4957f6f 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala @@ -261,4 +261,15 @@ class VectorAssemblerSuite val output = vectorAssembler.transform(dfWithNullsAndNaNs) assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, Seq.empty))) } + + test("SPARK-31671: should give explicit error message when can not infer column lengths") { +val df = Seq( + (Vectors.dense(1.0), Vectors.dense(2.0)) +).toDF("n1", "n2") +val hintedDf = new VectorSizeHint().setInputCol("n1").setSize(1).transform(df) +val assembler = new VectorAssembler() + .setInputCols(Array("n1", "n2")).setOutputCol("features") + assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf)) + .getMessage.contains("n1"), "should only show no vector size columns' name") + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (32a5398 -> 7a670b5)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 32a5398 [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps add 7a670b5 [SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest No new revisions were added by this update. Summary of changes: python/pyspark/ml/stat.py | 60 +-- 1 file changed, 48 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (32a5398 -> 7a670b5)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 32a5398 [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps add 7a670b5 [SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest No new revisions were added by this update. Summary of changes: python/pyspark/ml/stat.py | 60 +-- 1 file changed, 48 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7a670b5 -> 5a5af46)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7a670b5 [SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest add 5a5af46 [SPARK-31575][SQL] Synchronise global JVM security configuration modification No new revisions were added by this update. Summary of changes: .../jdbc/connection/DB2ConnectionProvider.scala| 2 +- .../connection/MariaDBConnectionProvider.scala | 2 +- .../connection/PostgresConnectionProvider.scala| 2 +- .../jdbc/connection/SecureConnectionProvider.scala | 9 - .../jdbc/connection/ConnectionProviderSuite.scala | 45 ++ 5 files changed, 56 insertions(+), 4 deletions(-) create mode 100644 sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProviderSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (7a670b5 -> 5a5af46)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 7a670b5 [SPARK-31667][ML][PYSPARK] Python side flatten the result dataframe of ANOVATest/ChisqTest/FValueTest add 5a5af46 [SPARK-31575][SQL] Synchronise global JVM security configuration modification No new revisions were added by this update. Summary of changes: .../jdbc/connection/DB2ConnectionProvider.scala| 2 +- .../connection/MariaDBConnectionProvider.scala | 2 +- .../connection/PostgresConnectionProvider.scala| 2 +- .../jdbc/connection/SecureConnectionProvider.scala | 9 - .../jdbc/connection/ConnectionProviderSuite.scala | 45 ++ 5 files changed, 56 insertions(+), 4 deletions(-) create mode 100644 sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProviderSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a75dc80 -> 9f768fa)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a75dc80 [SPARK-31636][SQL][DOCS] Remove HTML syntax in SQL reference add 9f768fa [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/RandomDataGenerator.scala | 23 +++--- 1 file changed, 20 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 6f7c719 [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps 6f7c719 is described below commit 6f7c71947073f147bc35da196139d5ceb6fbdf45 Author: Max Gekk AuthorDate: Sun May 10 14:22:12 2020 -0500 [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps ### What changes were proposed in this pull request? Shift non-existing dates in Proleptic Gregorian calendar by 1 day. The reason for that is `RowEncoderSuite` generates random dates/timestamps in the hybrid calendar, and some dates/timestamps don't exist in Proleptic Gregorian calendar like 1000-02-29 because 1000 is not leap year in Proleptic Gregorian calendar. ### Why are the changes needed? This makes RowEncoderSuite much stable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running RowEncoderSuite and set non-existing date manually: ```scala val date = new java.sql.Date(1000 - 1900, 1, 29) Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY)) ``` Closes #28486 from MaxGekk/fix-RowEncoderSuite. Authored-by: Max Gekk Signed-off-by: Sean Owen (cherry picked from commit 9f768fa9916dec3cc695e3f28ec77148d81d335f) Signed-off-by: Sean Owen --- .../org/apache/spark/sql/RandomDataGenerator.scala | 23 +++--- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala index a7c20c3..5a4d23d 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala @@ -18,9 +18,10 @@ package org.apache.spark.sql import java.math.MathContext +import java.sql.{Date, Timestamp} import scala.collection.mutable -import scala.util.Random +import scala.util.{Random, Try} import org.apache.spark.sql.catalyst.CatalystTypeConverters import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY @@ -172,7 +173,15 @@ object RandomDataGenerator { // January 1, 1970, 00:00:00 GMT for "-12-31 23:59:59.99". milliseconds = rand.nextLong() % 25340232959L } -DateTimeUtils.toJavaDate((milliseconds / MILLIS_PER_DAY).toInt) +val date = DateTimeUtils.toJavaDate((milliseconds / MILLIS_PER_DAY).toInt) +// The generated `date` is based on the hybrid calendar Julian + Gregorian since +// 1582-10-15 but it should be valid in Proleptic Gregorian calendar too which is used +// by Spark SQL since version 3.0 (see SPARK-26651). We try to convert `date` to +// a local date in Proleptic Gregorian calendar to satisfy this requirement. +// Some years are leap years in Julian calendar but not in Proleptic Gregorian calendar. +// As the consequence of that, 29 February of such years might not exist in Proleptic +// Gregorian calendar. When this happens, we shift the date by one day. +Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY)) } Some(generator) case TimestampType => @@ -188,7 +197,15 @@ object RandomDataGenerator { milliseconds = rand.nextLong() % 25340232959L } // DateTimeUtils.toJavaTimestamp takes microsecond. -DateTimeUtils.toJavaTimestamp(milliseconds * 1000) +val ts = DateTimeUtils.toJavaTimestamp(milliseconds * 1000) +// The generated `ts` is based on the hybrid calendar Julian + Gregorian since +// 1582-10-15 but it should be valid in Proleptic Gregorian calendar too which is used +// by Spark SQL since version 3.0 (see SPARK-26651). We try to convert `ts` to +// a local timestamp in Proleptic Gregorian calendar to satisfy this requirement. +// Some years are leap years in Julian calendar but not in Proleptic Gregorian calendar. +// As the consequence of that, 29 February of such years might not exist in Proleptic +// Gregorian calendar. When this happens, we shift the timestamp `ts` by one day. +Try { ts.toLocalDateTime; ts }.getOrElse(new Timestamp(ts.getTime + MILLIS_PER_DAY)) } Some(generator) case CalendarIntervalType => Some(() => { - To unsubscribe, e-mai
[spark] branch master updated (ce63bef -> a75dc80)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from ce63bef [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns add a75dc80 [SPARK-31636][SQL][DOCS] Remove HTML syntax in SQL reference No new revisions were added by this update. Summary of changes: docs/_data/menu-sql.yaml | 20 +- docs/sql-ref-ansi-compliance.md| 18 +- docs/sql-ref-datatypes.md | 4 +- docs/sql-ref-functions-builtin.md | 2 +- docs/sql-ref-functions-udf-aggregate.md| 101 docs/sql-ref-functions-udf-hive.md | 12 +- docs/sql-ref-functions-udf-scalar.md | 28 +- docs/sql-ref-identifier.md | 37 ++- docs/sql-ref-literals.md | 282 + docs/sql-ref-null-semantics.md | 44 ++-- docs/sql-ref-syntax-aux-analyze-table.md | 64 ++--- docs/sql-ref-syntax-aux-cache-cache-table.md | 98 +++ docs/sql-ref-syntax-aux-cache-clear-cache.md | 16 +- docs/sql-ref-syntax-aux-cache-refresh.md | 24 +- docs/sql-ref-syntax-aux-cache-uncache-table.md | 31 +-- docs/sql-ref-syntax-aux-conf-mgmt-reset.md | 10 +- docs/sql-ref-syntax-aux-conf-mgmt-set.md | 31 +-- docs/sql-ref-syntax-aux-describe-database.md | 21 +- docs/sql-ref-syntax-aux-describe-function.md | 30 +-- docs/sql-ref-syntax-aux-describe-query.md | 44 ++-- docs/sql-ref-syntax-aux-describe-table.md | 62 ++--- docs/sql-ref-syntax-aux-refresh-table.md | 31 +-- docs/sql-ref-syntax-aux-resource-mgmt-add-file.md | 21 +- docs/sql-ref-syntax-aux-resource-mgmt-add-jar.md | 21 +- docs/sql-ref-syntax-aux-resource-mgmt-list-file.md | 14 +- docs/sql-ref-syntax-aux-resource-mgmt-list-jar.md | 14 +- docs/sql-ref-syntax-aux-show-columns.md| 2 +- docs/sql-ref-syntax-aux-show-create-table.md | 27 +- docs/sql-ref-syntax-aux-show-databases.md | 32 +-- docs/sql-ref-syntax-aux-show-functions.md | 60 ++--- docs/sql-ref-syntax-aux-show-partitions.md | 47 ++-- docs/sql-ref-syntax-aux-show-table.md | 60 ++--- docs/sql-ref-syntax-aux-show-tables.md | 41 ++- docs/sql-ref-syntax-aux-show-tblproperties.md | 51 ++-- docs/sql-ref-syntax-aux-show-views.md | 45 ++-- docs/sql-ref-syntax-aux-show.md| 4 +- docs/sql-ref-syntax-ddl-alter-database.md | 17 +- docs/sql-ref-syntax-ddl-alter-table.md | 256 --- docs/sql-ref-syntax-ddl-alter-view.md | 124 - docs/sql-ref-syntax-ddl-create-database.md | 39 +-- docs/sql-ref-syntax-ddl-create-function.md | 85 +++ docs/sql-ref-syntax-ddl-create-table-datasource.md | 100 docs/sql-ref-syntax-ddl-create-table-hiveformat.md | 99 docs/sql-ref-syntax-ddl-create-table-like.md | 73 +++--- docs/sql-ref-syntax-ddl-create-table.md| 10 +- docs/sql-ref-syntax-ddl-create-view.md | 82 +++--- docs/sql-ref-syntax-ddl-drop-database.md | 42 ++- docs/sql-ref-syntax-ddl-drop-function.md | 55 ++-- docs/sql-ref-syntax-ddl-drop-table.md | 45 ++-- docs/sql-ref-syntax-ddl-drop-view.md | 49 ++-- docs/sql-ref-syntax-ddl-repair-table.md| 25 +- docs/sql-ref-syntax-ddl-truncate-table.md | 43 ++-- docs/sql-ref-syntax-dml-insert-into.md | 90 +++ ...f-syntax-dml-insert-overwrite-directory-hive.md | 75 +++--- ...ql-ref-syntax-dml-insert-overwrite-directory.md | 74 +++--- docs/sql-ref-syntax-dml-insert-overwrite-table.md | 87 +++ docs/sql-ref-syntax-dml-insert.md | 8 +- docs/sql-ref-syntax-dml-load.md| 67 ++--- docs/sql-ref-syntax-dml.md | 4 +- docs/sql-ref-syntax-qry-explain.md | 58 ++--- docs/sql-ref-syntax-qry-sampling.md| 20 +- docs/sql-ref-syntax-qry-select-clusterby.md| 33 ++- docs/sql-ref-syntax-qry-select-cte.md | 35 ++- docs/sql-ref-syntax-qry-select-distribute-by.md| 33 ++- docs/sql-ref-syntax-qry-select-groupby.md | 261 ++- docs/sql-ref-syntax-qry-select-having.md | 54 ++-- docs/sql-ref-syntax-qry-select-hints.md| 56 ++-- docs/sql-ref-syntax-qry-select-inline-table.md | 35 +-- docs/sql-ref-syntax-qry-select-join.md | 185 ++ docs/sql-ref-syntax-qry-select-like.md | 51 ++-- docs/sql-ref-syntax-qry-select-limit.md| 41 ++- docs/sql-ref-syntax-qry-select-orderby.md
[spark] branch master updated (b16ea8e -> 09ece50)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b16ea8e [SPARK-31650][SQL] Fix wrong UI in case of AdaptiveSparkPlanExec has unmanaged subqueries add 09ece50 [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark No new revisions were added by this update. Summary of changes: python/pyspark/ml/feature.py | 142 +++ 1 file changed, 142 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b16ea8e -> 09ece50)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b16ea8e [SPARK-31650][SQL] Fix wrong UI in case of AdaptiveSparkPlanExec has unmanaged subqueries add 09ece50 [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark No new revisions were added by this update. Summary of changes: python/pyspark/ml/feature.py | 142 +++ 1 file changed, 142 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 09ece50 [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark 09ece50 is described below commit 09ece50799222d577009a2bbd480304d1ae1e14e Author: Huaxin Gao AuthorDate: Wed May 6 09:11:03 2020 -0500 [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark ### What changes were proposed in this pull request? Add VarianceThresholdSelector to PySpark ### Why are the changes needed? parity between Scala and Python ### Does this PR introduce any user-facing change? Yes. VarianceThresholdSelector is added to PySpark ### How was this patch tested? new doctest Closes #28409 from huaxingao/variance_py. Authored-by: Huaxin Gao Signed-off-by: Sean Owen --- python/pyspark/ml/feature.py | 142 +++ 1 file changed, 142 insertions(+) diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 6df2f74..7acf8ce 100755 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -57,6 +57,7 @@ __all__ = ['Binarizer', 'StopWordsRemover', 'StringIndexer', 'StringIndexerModel', 'Tokenizer', + 'VarianceThresholdSelector', 'VarianceThresholdSelectorModel', 'VectorAssembler', 'VectorIndexer', 'VectorIndexerModel', 'VectorSizeHint', @@ -5381,6 +5382,147 @@ class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, JavaMLReada return self._set(handleInvalid=value) +class _VarianceThresholdSelectorParams(HasFeaturesCol, HasOutputCol): +""" +Params for :py:class:`VarianceThresholdSelector` and +:py:class:`VarianceThresholdSelectorrModel`. + +.. versionadded:: 3.1.0 +""" + +varianceThreshold = Param(Params._dummy(), "varianceThreshold", + "Param for variance threshold. Features with a variance not " + + "greater than this threshold will be removed. The default value " + + "is 0.0.", typeConverter=TypeConverters.toFloat) + +@since("3.1.0") +def getVarianceThreshold(self): +""" +Gets the value of varianceThreshold or its default value. +""" +return self.getOrDefault(self.varianceThreshold) + + +@inherit_doc +class VarianceThresholdSelector(JavaEstimator, _VarianceThresholdSelectorParams, JavaMLReadable, +JavaMLWritable): +""" +Feature selector that removes all low-variance features. Features with a +variance not greater than the threshold will be removed. The default is to keep +all features with non-zero variance, i.e. remove the features that have the +same value in all samples. + +>>> from pyspark.ml.linalg import Vectors +>>> df = spark.createDataFrame( +...[(Vectors.dense([6.0, 7.0, 0.0, 7.0, 6.0, 0.0]),), +... (Vectors.dense([0.0, 9.0, 6.0, 0.0, 5.0, 9.0]),), +... (Vectors.dense([0.0, 9.0, 3.0, 0.0, 5.0, 5.0]),), +... (Vectors.dense([0.0, 9.0, 8.0, 5.0, 6.0, 4.0]),), +... (Vectors.dense([8.0, 9.0, 6.0, 5.0, 4.0, 4.0]),), +... (Vectors.dense([8.0, 9.0, 6.0, 0.0, 0.0, 0.0]),)], +...["features"]) +>>> selector = VarianceThresholdSelector(varianceThreshold=8.2, outputCol="selectedFeatures") +>>> model = selector.fit(df) +>>> model.getFeaturesCol() +'features' +>>> model.setFeaturesCol("features") +VarianceThresholdSelectorModel... +>>> model.transform(df).head().selectedFeatures +DenseVector([6.0, 7.0, 0.0]) +>>> model.selectedFeatures +[0, 3, 5] +>>> varianceThresholdSelectorPath = temp_path + "/variance-threshold-selector" +>>> selector.save(varianceThresholdSelectorPath) +>>> loadedSelector = VarianceThresholdSelector.load(varianceThresholdSelectorPath) +>>> loadedSelector.getVarianceThreshold() == selector.getVarianceThreshold() +True +>>> modelPath = temp_path + "/variance-threshold-selector-model" +>>> model.save(modelPath) +>>> loadedModel = VarianceThresholdSelectorModel.load(modelPath) +>>> loadedModel.selectedFeatures == model.selectedFeatures +True + +.. versionadded:: 3.1.0 +""" + +@keyword_only +def __init__(self, featuresCol="features", outputCol=None, varianceThreshold=0.0): +
[spark] branch master updated (5052d95 -> 701deac)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 5052d95 [SPARK-31030][DOCS][FOLLOWUP] Replace HTML Table by Markdown Table add 701deac [SPARK-31603][ML] AFT uses common functions in RDDLossFunction No new revisions were added by this update. Summary of changes: .../spark/ml/optim/aggregator/AFTAggregator.scala | 162 +++ .../aggregator/DifferentiableLossAggregator.scala | 9 +- .../ml/regression/AFTSurvivalRegression.scala | 228 + 3 files changed, 173 insertions(+), 226 deletions(-) create mode 100644 mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31603][ML] AFT uses common functions in RDDLossFunction
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 701deac [SPARK-31603][ML] AFT uses common functions in RDDLossFunction 701deac is described below commit 701deac88d09690ddf9d28b9c79814aecfd3277d Author: zhengruifeng AuthorDate: Tue May 5 08:35:20 2020 -0500 [SPARK-31603][ML] AFT uses common functions in RDDLossFunction ### What changes were proposed in this pull request? 1, make AFT reuse common functions in `ml.optim`, rather than making its own impl. ### Why are the changes needed? The logic in optimizing AFT is quite similar to other algorithms like other algs based on `RDDLossFunction`, We should reuse the common functions. ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #28404 from zhengruifeng/mv_aft_optim. Authored-by: zhengruifeng Signed-off-by: Sean Owen --- .../spark/ml/optim/aggregator/AFTAggregator.scala | 162 +++ .../aggregator/DifferentiableLossAggregator.scala | 9 +- .../ml/regression/AFTSurvivalRegression.scala | 228 + 3 files changed, 173 insertions(+), 226 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala new file mode 100644 index 000..6482c61 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.optim.aggregator + +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg._ +import org.apache.spark.ml.regression.AFTPoint + +/** + * AFTAggregator computes the gradient and loss for a AFT loss function, + * as used in AFT survival regression for samples in sparse or dense vector in an online fashion. + * + * The loss function and likelihood function under the AFT model based on: + * Lawless, J. F., Statistical Models and Methods for Lifetime Data, + * New York: John Wiley & Sons, Inc. 2003. + * + * Two AFTAggregator can be merged together to have a summary of loss and gradient of + * the corresponding joint dataset. + * + * Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of subjects i = 1,..,n, + * with possible right-censoring, the likelihood function under the AFT model is given as + * + * + *$$ + *L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0} + * (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0} + *(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}} + *$$ + * + * + * Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not. + * Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function + * assumes the form + * + * + *$$ + *\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+ + * \delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}] + *$$ + * + * Where $S_{0}(\epsilon_{i})$ is the baseline survivor function, + * and $f_{0}(\epsilon_{i})$ is corresponding density function. + * + * The most commonly used log-linear survival regression method is based on the Weibull + * distribution of the survival time. The Weibull distribution for lifetime corresponding + * to extreme value distribution for log of the lifetime, + * and the $S_{0}(\epsilon)$ function is + * + * + *$$ + *S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}}) + *$$ + * + * + * and the $f_{0}(\epsilon_{i})$ function is + * + * + *$$ + *f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}}) + *$$ + * + * + * The log-likelihood function for Weibull distribution of lifetime is + * + * + *$$ + *\iota(\beta,\sigma)= + * -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}] + *$$ + * + * + * Due to minimizing the negative l
[spark] branch master updated: [SPARK-31307][ML][EXAMPLES] Add examples for ml.fvalue
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 348fd53 [SPARK-31307][ML][EXAMPLES] Add examples for ml.fvalue 348fd53 is described below commit 348fd53214ccc476bee37e3ddd6b075a53886104 Author: Qianyang Yu AuthorDate: Fri May 1 09:16:08 2020 -0500 [SPARK-31307][ML][EXAMPLES] Add examples for ml.fvalue ### What changes were proposed in this pull request? Add FValue example for ml.stat.FValueTest in python/java/scala ### Why are the changes needed? Improve ML example ### Does this PR introduce any user-facing change? No ### How was this patch tested? manually run the example Closes #28400 from kevinyu98/spark-26111-fvalue-examples. Authored-by: Qianyang Yu Signed-off-by: Sean Owen --- .../spark/examples/ml/JavaFValueTestExample.java | 75 ++ examples/src/main/python/ml/fvalue_test_example.py | 52 +++ .../spark/examples/ml/FVlaueTestExample.scala | 63 ++ 3 files changed, 190 insertions(+) diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java new file mode 100644 index 000..11861ac --- /dev/null +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml; + +import org.apache.spark.sql.SparkSession; + +// $example on$ +import java.util.Arrays; +import java.util.List; + +import org.apache.spark.ml.linalg.Vectors; +import org.apache.spark.ml.linalg.VectorUDT; +import org.apache.spark.ml.stat.FValueTest; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.types.*; +// $example off$ + +/** + * An example for FValue testing. + * Run with + * + * bin/run-example ml.JavaFValueTestExample + * + */ +public class JavaFValueTestExample { + + public static void main(String[] args) { +SparkSession spark = SparkSession + .builder() + .appName("JavaFValueTestExample") + .getOrCreate(); + +// $example on$ +List data = Arrays.asList( + RowFactory.create(4.6, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)), + RowFactory.create(6.6, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)), + RowFactory.create(5.1, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)), + RowFactory.create(7.6, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)), + RowFactory.create(9.0, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)), + RowFactory.create(9.0, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)) +); + +StructType schema = new StructType(new StructField[]{ + new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), + new StructField("features", new VectorUDT(), false, Metadata.empty()), +}); + +Dataset df = spark.createDataFrame(data, schema); +Row r = FValueTest.test(df, "features", "label").head(); +System.out.println("pValues: " + r.get(0).toString()); +System.out.println("degreesOfFreedom: " + r.getList(1).toString()); +System.out.println("fvalue: " + r.get(2).toString()); + +// $example off$ + +spark.stop(); + } +} diff --git a/examples/src/main/python/ml/fvalue_test_example.py b/examples/src/main/python/ml/fvalue_test_example.py new file mode 100644 index 000..4a97bcd --- /dev/null +++ b/examples/src/main/python/ml/fvalue_test_example.py @@ -0,0 +1,52 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not
[spark] branch branch-3.0 updated: [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new f8ff9c5 [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started f8ff9c5 is described below commit f8ff9c5eff55ba7003a51f9ac91786d16764f4c9 Author: Huaxin Gao AuthorDate: Tue Apr 28 11:17:45 2020 -0500 [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started ### What changes were proposed in this pull request? Add a paragraph for scalar function in sql getting started ### Why are the changes needed? To make 3.0 doc complete. ### Does this PR introduce any user-facing change? before: https://user-images.githubusercontent.com/13592258/79943182-16d1fd00-841d-11ea-9744-9cdd58d83f81.png;> after: https://user-images.githubusercontent.com/13592258/80068256-26704500-84f4-11ea-9845-c835927c027e.png;> https://user-images.githubusercontent.com/13592258/80165100-82d47280-858f-11ea-8c84-1ef702cc1bff.png;> ### How was this patch tested? Closes #28290 from huaxingao/scalar. Authored-by: Huaxin Gao Signed-off-by: Sean Owen (cherry picked from commit dcc09022f1b8ecedf6b64bf35ce5d83500211351) Signed-off-by: Sean Owen --- docs/sql-getting-started.md | 13 + docs/sql-ref-functions.md | 7 +-- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/sql-getting-started.md b/docs/sql-getting-started.md index dab34af..5a6f182 100644 --- a/docs/sql-getting-started.md +++ b/docs/sql-getting-started.md @@ -347,16 +347,13 @@ For example: ## Scalar Functions -(to be filled soon) -## Aggregations +Scalar functions are functions that return a single value per row, as opposed to aggregation functions, which return a value for a group of rows. Spark SQL supports a variety of [Built-in Scalar Functions](sql-ref-functions.html#scalar-functions). It also supports [User Defined Scalar Functions](sql-ref-functions-udf-scalar.html). -The [built-in DataFrames functions](api/scala/org/apache/spark/sql/functions$.html) provide common -aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc. -While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in -[Scala](api/scala/org/apache/spark/sql/expressions/scalalang/typed$.html) and -[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work with strongly typed Datasets. -Moreover, users are not limited to the predefined aggregate functions and can create their own. For more details +## Aggregate Functions + +Aggregate functions are functions that return a single value on a group of rows. The [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc. +Users are not limited to the predefined aggregate functions and can create their own. For more details about user defined aggregate functions, please refer to the documentation of [User Defined Aggregate Functions](sql-ref-functions-udf-aggregate.html). diff --git a/docs/sql-ref-functions.md b/docs/sql-ref-functions.md index 6368fb7..7493b8b 100644 --- a/docs/sql-ref-functions.md +++ b/docs/sql-ref-functions.md @@ -27,13 +27,16 @@ Built-in functions are commonly used routines that Spark SQL predefines and a co Spark SQL has some categories of frequently-used built-in functions for aggregtion, arrays/maps, date/timestamp, and JSON data. This subsection presents the usages and descriptions of these functions. - * [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions) - * [Window Functions](sql-ref-functions-builtin.html#window-functions) + Scalar Functions * [Array Functions](sql-ref-functions-builtin.html#array-functions) * [Map Functions](sql-ref-functions-builtin.html#map-functions) * [Date and Timestamp Functions](sql-ref-functions-builtin.html#date-and-timestamp-functions) * [JSON Functions](sql-ref-functions-builtin.html#json-functions) + Aggregate-like Functions + * [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions) + * [Window Functions](sql-ref-functions-builtin.html#window-functions) + ### UDFs (User-Defined Functions) User-Defined Functions (UDFs) are a feature of Spark SQL that allows users to define their own functions when the system's built-in functions are not enough to perform the desired task. To use UDFs in Spark SQL, users must first define the function, then register the function with Spark, and finally call the registered function. The User-Defined Functions can act on a single row or act on multiple rows at once. Spark SQL also supports integration of exis
[spark] branch master updated: [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dcc0902 [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started dcc0902 is described below commit dcc09022f1b8ecedf6b64bf35ce5d83500211351 Author: Huaxin Gao AuthorDate: Tue Apr 28 11:17:45 2020 -0500 [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started ### What changes were proposed in this pull request? Add a paragraph for scalar function in sql getting started ### Why are the changes needed? To make 3.0 doc complete. ### Does this PR introduce any user-facing change? before: https://user-images.githubusercontent.com/13592258/79943182-16d1fd00-841d-11ea-9744-9cdd58d83f81.png;> after: https://user-images.githubusercontent.com/13592258/80068256-26704500-84f4-11ea-9845-c835927c027e.png;> https://user-images.githubusercontent.com/13592258/80165100-82d47280-858f-11ea-8c84-1ef702cc1bff.png;> ### How was this patch tested? Closes #28290 from huaxingao/scalar. Authored-by: Huaxin Gao Signed-off-by: Sean Owen --- docs/sql-getting-started.md | 13 + docs/sql-ref-functions.md | 7 +-- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/sql-getting-started.md b/docs/sql-getting-started.md index dab34af..5a6f182 100644 --- a/docs/sql-getting-started.md +++ b/docs/sql-getting-started.md @@ -347,16 +347,13 @@ For example: ## Scalar Functions -(to be filled soon) -## Aggregations +Scalar functions are functions that return a single value per row, as opposed to aggregation functions, which return a value for a group of rows. Spark SQL supports a variety of [Built-in Scalar Functions](sql-ref-functions.html#scalar-functions). It also supports [User Defined Scalar Functions](sql-ref-functions-udf-scalar.html). -The [built-in DataFrames functions](api/scala/org/apache/spark/sql/functions$.html) provide common -aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc. -While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in -[Scala](api/scala/org/apache/spark/sql/expressions/scalalang/typed$.html) and -[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work with strongly typed Datasets. -Moreover, users are not limited to the predefined aggregate functions and can create their own. For more details +## Aggregate Functions + +Aggregate functions are functions that return a single value on a group of rows. The [Built-in Aggregation Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc. +Users are not limited to the predefined aggregate functions and can create their own. For more details about user defined aggregate functions, please refer to the documentation of [User Defined Aggregate Functions](sql-ref-functions-udf-aggregate.html). diff --git a/docs/sql-ref-functions.md b/docs/sql-ref-functions.md index 6368fb7..7493b8b 100644 --- a/docs/sql-ref-functions.md +++ b/docs/sql-ref-functions.md @@ -27,13 +27,16 @@ Built-in functions are commonly used routines that Spark SQL predefines and a co Spark SQL has some categories of frequently-used built-in functions for aggregtion, arrays/maps, date/timestamp, and JSON data. This subsection presents the usages and descriptions of these functions. - * [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions) - * [Window Functions](sql-ref-functions-builtin.html#window-functions) + Scalar Functions * [Array Functions](sql-ref-functions-builtin.html#array-functions) * [Map Functions](sql-ref-functions-builtin.html#map-functions) * [Date and Timestamp Functions](sql-ref-functions-builtin.html#date-and-timestamp-functions) * [JSON Functions](sql-ref-functions-builtin.html#json-functions) + Aggregate-like Functions + * [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions) + * [Window Functions](sql-ref-functions-builtin.html#window-functions) + ### UDFs (User-Defined Functions) User-Defined Functions (UDFs) are a feature of Spark SQL that allows users to define their own functions when the system's built-in functions are not enough to perform the desired task. To use UDFs in Spark SQL, users must first define the function, then register the function with Spark, and finally call the registered function. The User-Defined Functions can act on a single row or act on multiple rows at once. Spark SQL also supports integration of existing Hive implementation
[spark] branch branch-3.0 updated (3b30066 -> 6f10c8a)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git. from 3b30066 [SPARK-31529][SQL][3.0] Remove extra whitespaces in formatted explain add 6f10c8a [SPARK-31569][SQL][DOCS] Add links to subsections in SQL Reference main page No new revisions were added by this update. Summary of changes: docs/sql-ref.md | 26 -- 1 file changed, 20 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31569][SQL][DOCS] Add links to subsections in SQL Reference main page
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7735db2a2 [SPARK-31569][SQL][DOCS] Add links to subsections in SQL Reference main page 7735db2a2 is described below commit 7735db2a273edf208ae50e88926c9f7a77e5dbac Author: Huaxin Gao AuthorDate: Mon Apr 27 09:45:00 2020 -0500 [SPARK-31569][SQL][DOCS] Add links to subsections in SQL Reference main page ### What changes were proposed in this pull request? Add links to subsections in SQL Reference main page ### Why are the changes needed? Make SQL Reference complete ### Does this PR introduce any user-facing change? Yes before: https://user-images.githubusercontent.com/13592258/80338238-a9551080-8810-11ea-8ae8-d6707fde2cac.png;> after: https://user-images.githubusercontent.com/13592258/80338241-ac500100-8810-11ea-8518-95c4f8c0a2eb.png;> ### How was this patch tested? Manually build and check. Closes #28360 from huaxingao/sql-ref. Authored-by: Huaxin Gao Signed-off-by: Sean Owen --- docs/sql-ref.md | 26 -- 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/docs/sql-ref.md b/docs/sql-ref.md index 6c57b0d6..db51fe1 100644 --- a/docs/sql-ref.md +++ b/docs/sql-ref.md @@ -1,7 +1,7 @@ --- layout: global -title: Reference -displayTitle: Reference +title: SQL Reference +displayTitle: SQL Reference license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -19,7 +19,21 @@ license: | limitations under the License. --- -Spark SQL is Apache Spark's module for working with structured data. -This guide is a reference for Structured Query Language (SQL) for Apache -Spark. This document describes the SQL constructs supported by Spark in detail -along with usage examples when applicable. +Spark SQL is Apache Spark's module for working with structured data. This guide is a reference for Structured Query Language (SQL) and includes syntax, semantics, keywords, and examples for common SQL usage. It contains information for the following topics: + + * [Data Types](sql-ref-datatypes.html) + * [Identifiers](sql-ref-identifier.html) + * [Literals](sql-ref-literals.html) + * [Null Semanitics](sql-ref-null-semantics.html) + * [ANSI Compliance](sql-ref-ansi-compliance.html) + * [SQL Syntax](sql-ref-syntax.html) + * [DDL Statements](sql-ref-syntax-ddl.html) + * [DML Statements](sql-ref-syntax-ddl.html) + * [Data Retrieval Statements](sql-ref-syntax-qry.html) + * [Auxiliary Statements](sql-ref-syntax-aux.html) + * [Functions](sql-ref-functions.html) + * [Built-in Functions](sql-ref-functions-builtin.html) + * [Scalar User-Defined Functions (UDFs)](sql-ref-functions-udf-scalar.html) + * [User-Defined Aggregate Functions (UDAFs)](sql-ref-functions-udf-aggregate.html) + * [Integration with Hive UDFs/UDAFs/UDTFs](sql-ref-functions-udf-hive.html) + * [Datetime Pattern](sql-ref-datetime-pattern.html) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-31400][ML] The catalogString doesn't distinguish Vectors in ml and mllib
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fe07b21 [SPARK-31400][ML] The catalogString doesn't distinguish Vectors in ml and mllib fe07b21 is described below commit fe07b21b8ab60def6c4451c661e4dd46a4d48b5a Author: TJX2014 AuthorDate: Sun Apr 26 11:35:44 2020 -0500 [SPARK-31400][ML] The catalogString doesn't distinguish Vectors in ml and mllib What changes were proposed in this pull request? 1.Add class info output in org.apache.spark.ml.util.SchemaUtils#checkColumnType to distinct Vectors in ml and mllib 2.Add unit test Why are the changes needed? the catalogString doesn't distinguish Vectors in ml and mllib when mllib vector misused in ml https://issues.apache.org/jira/browse/SPARK-31400 Does this PR introduce any user-facing change? No How was this patch tested? Unit test is added Closes #28347 from TJX2014/master-catalogString-distinguish-Vectors-in-ml-and-mllib. Authored-by: TJX2014 Signed-off-by: Sean Owen --- .../org/apache/spark/ml/util/SchemaUtils.scala | 4 ++-- .../apache/spark/mllib/util/TestingUtilsSuite.scala | 21 - 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala b/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala index 752069d..c08d7e8 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala @@ -42,8 +42,8 @@ private[spark] object SchemaUtils { val actualDataType = schema(colName).dataType val message = if (msg != null && msg.trim.length > 0) " " + msg else "" require(actualDataType.equals(dataType), - s"Column $colName must be of type ${dataType.catalogString} but was actually " + -s"${actualDataType.catalogString}.$message") + s"Column $colName must be of type ${dataType.getClass}:${dataType.catalogString} " + +s"but was actually ${actualDataType.getClass}:${actualDataType.catalogString}.$message") } /** diff --git a/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtilsSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtilsSuite.scala index 3fcf1cf..bc80e86 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtilsSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtilsSuite.scala @@ -20,9 +20,11 @@ package org.apache.spark.mllib.util import org.scalatest.exceptions.TestFailedException import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.linalg.VectorUDT +import org.apache.spark.ml.util.SchemaUtils import org.apache.spark.mllib.linalg.{Matrices, Vectors} import org.apache.spark.mllib.util.TestingUtils._ - +import org.apache.spark.sql.types.{StructField, StructType} class TestingUtilsSuite extends SparkFunSuite { test("Comparing doubles using relative error.") { @@ -457,4 +459,21 @@ class TestingUtilsSuite extends SparkFunSuite { assert(Matrices.sparse(2, 2, Array(0, 1, 2), Array(0, 1), Array(3.1, 3.5)) !~= Matrices.dense(0, 0, Array()) relTol 0.01) } + + test("SPARK-31400, catalogString distinguish Vectors in ml and mllib") { +val schema = StructType(Array[StructField] { + StructField("features", new org.apache.spark.mllib.linalg.VectorUDT) +}) +val e = intercept[IllegalArgumentException] { + SchemaUtils.checkColumnType(schema, "features", new VectorUDT) +} +assert(e.getMessage.contains( + "org.apache.spark.mllib.linalg.VectorUDT:struct"), + "dataType is not desired") + +val normalSchema = StructType(Array[StructField] { + StructField("features", new VectorUDT) +}) +SchemaUtils.checkColumnType(normalSchema, "features", new VectorUDT) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b10263b -> 0ede08b)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b10263b [SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators add 0ede08b [SPARK-31007][ML] KMeans optimization based on triangle-inequality No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ml/impl/Utils.scala | 53 - .../spark/ml/clustering/GaussianMixture.scala | 16 +- .../spark/mllib/clustering/DistanceMeasure.scala | 223 - .../org/apache/spark/mllib/clustering/KMeans.scala | 52 +++-- .../spark/mllib/clustering/KMeansModel.scala | 14 +- .../mllib/clustering/DistanceMeasureSuite.scala| 77 +++ 6 files changed, 390 insertions(+), 45 deletions(-) create mode 100644 mllib/src/test/scala/org/apache/spark/mllib/clustering/DistanceMeasureSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (b10263b -> 0ede08b)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from b10263b [SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators add 0ede08b [SPARK-31007][ML] KMeans optimization based on triangle-inequality No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/ml/impl/Utils.scala | 53 - .../spark/ml/clustering/GaussianMixture.scala | 16 +- .../spark/mllib/clustering/DistanceMeasure.scala | 223 - .../org/apache/spark/mllib/clustering/KMeans.scala | 52 +++-- .../spark/mllib/clustering/KMeansModel.scala | 14 +- .../mllib/clustering/DistanceMeasureSuite.scala| 77 +++ 6 files changed, 390 insertions(+), 45 deletions(-) create mode 100644 mllib/src/test/scala/org/apache/spark/mllib/clustering/DistanceMeasureSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 6bc6b0d [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment 6bc6b0d is described below commit 6bc6b0d4400f2ba0338770662ebafad8a0de41ac Author: Cong Du AuthorDate: Wed Apr 22 09:44:43 2020 -0500 [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment ### What changes were proposed in this pull request? This PR fixes a typo in deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala file. ### Why are the changes needed? To deliver correct explanation about how the placement policy works. ### Does this PR introduce any user-facing change? No ### How was this patch tested? UT as specified, although shouldn't influence any functionality since it's in the comment. Closes #28267 from asclepiusaka/master. Authored-by: Cong Du Signed-off-by: Sean Owen (cherry picked from commit 54b97b2e143774a7238fc5a5f63e0d6eec138c41) Signed-off-by: Sean Owen --- .../yarn/LocalityPreferredContainerPlacementStrategy.scala | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala index 2288bb5..3e33382 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala @@ -40,7 +40,7 @@ private[yarn] case class ContainerLocalityPreferences(nodes: Array[String], rack * and cpus per task is 1, so the required container number is 15, * and host ratio is (host1: 30, host2: 30, host3: 20, host4: 10). * - * 1. If requested container number (18) is more than the required container number (15): + * 1. If the requested container number (18) is more than the required container number (15): * * requests for 5 containers with nodes: (host1, host2, host3, host4) * requests for 5 containers with nodes: (host1, host2, host3) @@ -63,16 +63,16 @@ private[yarn] case class ContainerLocalityPreferences(nodes: Array[String], rack * follow the method of 1 and 2. * * 4. If containers exist and some of them can match the requested localities. - * For example if we have 1 containers on each node (host1: 1, host2: 1: host3: 1, host4: 1), + * For example if we have 1 container on each node (host1: 1, host2: 1: host3: 1, host4: 1), * and the expected containers on each node would be (host1: 5, host2: 5, host3: 4, host4: 2), * so the newly requested containers on each node would be updated to (host1: 4, host2: 4, * host3: 3, host4: 1), 12 containers by total. * * 4.1 If requested container number (18) is more than newly required containers (12). Follow - * method 1 with updated ratio 4 : 4 : 3 : 1. + * method 1 with an updated ratio 4 : 4 : 3 : 1. * - * 4.2 If request container number (10) is more than newly required containers (12). Follow - * method 2 with updated ratio 4 : 4 : 3 : 1. + * 4.2 If request container number (10) is less than newly required containers (12). Follow + * method 2 with an updated ratio 4 : 4 : 3 : 1. * * 5. If containers exist and existing localities can fully cover the requested localities. * For example if we have 5 containers on each node (host1: 5, host2: 5, host3: 5, host4: 5), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 6bc6b0d [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment 6bc6b0d is described below commit 6bc6b0d4400f2ba0338770662ebafad8a0de41ac Author: Cong Du AuthorDate: Wed Apr 22 09:44:43 2020 -0500 [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment ### What changes were proposed in this pull request? This PR fixes a typo in deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala file. ### Why are the changes needed? To deliver correct explanation about how the placement policy works. ### Does this PR introduce any user-facing change? No ### How was this patch tested? UT as specified, although shouldn't influence any functionality since it's in the comment. Closes #28267 from asclepiusaka/master. Authored-by: Cong Du Signed-off-by: Sean Owen (cherry picked from commit 54b97b2e143774a7238fc5a5f63e0d6eec138c41) Signed-off-by: Sean Owen --- .../yarn/LocalityPreferredContainerPlacementStrategy.scala | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala index 2288bb5..3e33382 100644 --- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala +++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala @@ -40,7 +40,7 @@ private[yarn] case class ContainerLocalityPreferences(nodes: Array[String], rack * and cpus per task is 1, so the required container number is 15, * and host ratio is (host1: 30, host2: 30, host3: 20, host4: 10). * - * 1. If requested container number (18) is more than the required container number (15): + * 1. If the requested container number (18) is more than the required container number (15): * * requests for 5 containers with nodes: (host1, host2, host3, host4) * requests for 5 containers with nodes: (host1, host2, host3) @@ -63,16 +63,16 @@ private[yarn] case class ContainerLocalityPreferences(nodes: Array[String], rack * follow the method of 1 and 2. * * 4. If containers exist and some of them can match the requested localities. - * For example if we have 1 containers on each node (host1: 1, host2: 1: host3: 1, host4: 1), + * For example if we have 1 container on each node (host1: 1, host2: 1: host3: 1, host4: 1), * and the expected containers on each node would be (host1: 5, host2: 5, host3: 4, host4: 2), * so the newly requested containers on each node would be updated to (host1: 4, host2: 4, * host3: 3, host4: 1), 12 containers by total. * * 4.1 If requested container number (18) is more than newly required containers (12). Follow - * method 1 with updated ratio 4 : 4 : 3 : 1. + * method 1 with an updated ratio 4 : 4 : 3 : 1. * - * 4.2 If request container number (10) is more than newly required containers (12). Follow - * method 2 with updated ratio 4 : 4 : 3 : 1. + * 4.2 If request container number (10) is less than newly required containers (12). Follow + * method 2 with an updated ratio 4 : 4 : 3 : 1. * * 5. If containers exist and existing localities can fully cover the requested localities. * For example if we have 5 containers on each node (host1: 5, host2: 5, host3: 5, host4: 5), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8b77b31 -> 54b97b2)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8b77b31 [SPARK-18886][CORE][FOLLOWUP] allow follow up locality resets even if no task was launched add 54b97b2 [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment No new revisions were added by this update. Summary of changes: .../yarn/LocalityPreferredContainerPlacementStrategy.scala | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (8b77b31 -> 54b97b2)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 8b77b31 [SPARK-18886][CORE][FOLLOWUP] allow follow up locality resets even if no task was launched add 54b97b2 [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment No new revisions were added by this update. Summary of changes: .../yarn/LocalityPreferredContainerPlacementStrategy.scala | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-2.4 updated: Apply appropriate RPC handler to receive, receiveStream when auth enabled
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-2.4 by this push: new 9416b7c Apply appropriate RPC handler to receive, receiveStream when auth enabled 9416b7c is described below commit 9416b7c54bdf5613c1a65e6d1779a87591c6c9bd Author: Sean Owen AuthorDate: Fri Apr 17 13:25:12 2020 -0500 Apply appropriate RPC handler to receive, receiveStream when auth enabled --- .../spark/network/crypto/AuthRpcHandler.java | 73 +++--- .../apache/spark/network/sasl/SaslRpcHandler.java | 60 +++- .../network/server/AbstractAuthRpcHandler.java | 107 + .../spark/network/crypto/AuthIntegrationSuite.java | 12 +-- .../apache/spark/network/sasl/SparkSaslSuite.java | 3 +- 5 files changed, 142 insertions(+), 113 deletions(-) diff --git a/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java b/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java index 821cc7a..dd31c95 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java +++ b/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java @@ -29,12 +29,11 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.spark.network.client.RpcResponseCallback; -import org.apache.spark.network.client.StreamCallbackWithID; import org.apache.spark.network.client.TransportClient; import org.apache.spark.network.sasl.SecretKeyHolder; import org.apache.spark.network.sasl.SaslRpcHandler; +import org.apache.spark.network.server.AbstractAuthRpcHandler; import org.apache.spark.network.server.RpcHandler; -import org.apache.spark.network.server.StreamManager; import org.apache.spark.network.util.TransportConf; /** @@ -46,7 +45,7 @@ import org.apache.spark.network.util.TransportConf; * The delegate will only receive messages if the given connection has been successfully * authenticated. A connection may be authenticated at most once. */ -class AuthRpcHandler extends RpcHandler { +class AuthRpcHandler extends AbstractAuthRpcHandler { private static final Logger LOG = LoggerFactory.getLogger(AuthRpcHandler.class); /** Transport configuration. */ @@ -55,36 +54,31 @@ class AuthRpcHandler extends RpcHandler { /** The client channel. */ private final Channel channel; - /** - * RpcHandler we will delegate to for authenticated connections. When falling back to SASL - * this will be replaced with the SASL RPC handler. - */ - @VisibleForTesting - RpcHandler delegate; - /** Class which provides secret keys which are shared by server and client on a per-app basis. */ private final SecretKeyHolder secretKeyHolder; - /** Whether auth is done and future calls should be delegated. */ + /** RPC handler for auth handshake when falling back to SASL auth. */ @VisibleForTesting - boolean doDelegate; + SaslRpcHandler saslHandler; AuthRpcHandler( TransportConf conf, Channel channel, RpcHandler delegate, SecretKeyHolder secretKeyHolder) { +super(delegate); this.conf = conf; this.channel = channel; -this.delegate = delegate; this.secretKeyHolder = secretKeyHolder; } @Override - public void receive(TransportClient client, ByteBuffer message, RpcResponseCallback callback) { -if (doDelegate) { - delegate.receive(client, message, callback); - return; + protected boolean doAuthChallenge( + TransportClient client, + ByteBuffer message, + RpcResponseCallback callback) { +if (saslHandler != null) { + return saslHandler.doAuthChallenge(client, message, callback); } int position = message.position(); @@ -98,18 +92,17 @@ class AuthRpcHandler extends RpcHandler { if (conf.saslFallback()) { LOG.warn("Failed to parse new auth challenge, reverting to SASL for client {}.", channel.remoteAddress()); -delegate = new SaslRpcHandler(conf, channel, delegate, secretKeyHolder); +saslHandler = new SaslRpcHandler(conf, channel, null, secretKeyHolder); message.position(position); message.limit(limit); -delegate.receive(client, message, callback); -doDelegate = true; +return saslHandler.doAuthChallenge(client, message, callback); } else { LOG.debug("Unexpected challenge message from client {}, closing channel.", channel.remoteAddress()); callback.onFailure(new IllegalArgumentException("Unknown challenge message.")); channel.close(); } - return; + return false; } // Here we have the client challenge, so perform the new auth protocol and set up the channel. @@ -131,7