from:"srowen"

[spark] branch master updated (b1adc3d -> 88a4e55)

2020-06-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b1adc3d  [SPARK-21117][SQL] Built-in SQL Function Support - 
WIDTH_BUCKET
 add 88a4e55  [SPARK-31765][WEBUI][TEST-MAVEN] Upgrade HtmlUnit >= 2.37.0

No new revisions were added by this update.

Summary of changes:
 core/pom.xml   |  2 +-
 core/src/main/scala/org/apache/spark/ui/JettyUtils.scala   |  7 ++-
 .../test/scala/org/apache/spark/ui/UISeleniumSuite.scala   |  2 +-
 pom.xml| 14 +-
 sql/core/pom.xml   |  2 +-
 sql/hive-thriftserver/pom.xml  |  2 +-
 streaming/pom.xml  |  2 +-
 7 files changed, 20 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ca2cfd4 -> 6befb2d)

2020-06-09 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ca2cfd4  [SPARK-31906][SQL][DOCS] Enhance comments in 
NamedExpression.qualifier
 add 6befb2d  [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to 
control spark-submit exit in Standalone Cluster Mode

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/Client.scala | 95 --
 docs/spark-standalone.md   | 19 +
 2 files changed, 88 insertions(+), 26 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ca2cfd4 -> 6befb2d)

2020-06-09 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ca2cfd4  [SPARK-31906][SQL][DOCS] Enhance comments in 
NamedExpression.qualifier
 add 6befb2d  [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to 
control spark-submit exit in Standalone Cluster Mode

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/Client.scala | 95 --
 docs/spark-standalone.md   | 19 +
 2 files changed, 88 insertions(+), 26 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ca2cfd4 -> 6befb2d)

2020-06-09 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ca2cfd4  [SPARK-31906][SQL][DOCS] Enhance comments in 
NamedExpression.qualifier
 add 6befb2d  [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to 
control spark-submit exit in Standalone Cluster Mode

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/Client.scala | 95 --
 docs/spark-standalone.md   | 19 +
 2 files changed, 88 insertions(+), 26 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ca2cfd4 -> 6befb2d)

2020-06-09 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ca2cfd4  [SPARK-31906][SQL][DOCS] Enhance comments in 
NamedExpression.qualifier
 add 6befb2d  [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to 
control spark-submit exit in Standalone Cluster Mode

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/Client.scala | 95 --
 docs/spark-standalone.md   | 19 +
 2 files changed, 88 insertions(+), 26 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (ca2cfd4 -> 6befb2d)

2020-06-09 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ca2cfd4  [SPARK-31906][SQL][DOCS] Enhance comments in 
NamedExpression.qualifier
 add 6befb2d  [SPARK-31486][CORE] spark.submit.waitAppCompletion flag to 
control spark-submit exit in Standalone Cluster Mode

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/deploy/Client.scala | 95 --
 docs/spark-standalone.md   | 19 +
 2 files changed, 88 insertions(+), 26 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 3bb0824  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide
3bb0824 is described below

commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a
Author: Enrico Minack 
AuthorDate: Wed Jun 3 18:06:13 2020 -0500

[SPARK-31853][DOCS] Mention removal of params mixins setter in migration 
guide

### What changes were proposed in this pull request?
The Pyspark Migration Guide needs to mention a breaking change of the 
Pyspark ML API.

### Why are the changes needed?
In SPARK-29093, all setters have been removed from `Params` mixins in 
`pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML 
API, hence this is a breaking change.

### Does this PR introduce _any_ user-facing change?
Only documentation.

### How was this patch tested?
Visually.

Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters.

Authored-by: Enrico Minack 
Signed-off-by: Sean Owen 
(cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66)
Signed-off-by: Sean Owen 
---
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md
index 6f0fbbf..2c9ea41 100644
--- a/docs/pyspark-migration-guide.md
+++ b/docs/pyspark-migration-guide.md
@@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and 
DataFrame](sql-migration-guide.
 
 - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when 
constructing with named arguments for Python versions 3.6 and above, and the 
order of fields will match that as entered. To enable sorted fields by default, 
as in Spark 2.4, set the environment variable 
`PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - 
this environment variable must be consistent on all executors and driver; 
otherwise, it may cause failures or incorrect answers. For  [...]
 
+- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any 
`set*(self, value)` setter methods anymore, use the respective 
`self.set(self.*, value)` instead. See 
[SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details.
+
 ## Upgrading from PySpark 2.3 to 2.4
 
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 3bb0824  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide
3bb0824 is described below

commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a
Author: Enrico Minack 
AuthorDate: Wed Jun 3 18:06:13 2020 -0500

[SPARK-31853][DOCS] Mention removal of params mixins setter in migration 
guide

### What changes were proposed in this pull request?
The Pyspark Migration Guide needs to mention a breaking change of the 
Pyspark ML API.

### Why are the changes needed?
In SPARK-29093, all setters have been removed from `Params` mixins in 
`pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML 
API, hence this is a breaking change.

### Does this PR introduce _any_ user-facing change?
Only documentation.

### How was this patch tested?
Visually.

Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters.

Authored-by: Enrico Minack 
Signed-off-by: Sean Owen 
(cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66)
Signed-off-by: Sean Owen 
---
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md
index 6f0fbbf..2c9ea41 100644
--- a/docs/pyspark-migration-guide.md
+++ b/docs/pyspark-migration-guide.md
@@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and 
DataFrame](sql-migration-guide.
 
 - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when 
constructing with named arguments for Python versions 3.6 and above, and the 
order of fields will match that as entered. To enable sorted fields by default, 
as in Spark 2.4, set the environment variable 
`PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - 
this environment variable must be consistent on all executors and driver; 
otherwise, it may cause failures or incorrect answers. For  [...]
 
+- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any 
`set*(self, value)` setter methods anymore, use the respective 
`self.set(self.*, value)` instead. See 
[SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details.
+
 ## Upgrading from PySpark 2.3 to 2.4
 
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 3bb0824  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide
3bb0824 is described below

commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a
Author: Enrico Minack 
AuthorDate: Wed Jun 3 18:06:13 2020 -0500

[SPARK-31853][DOCS] Mention removal of params mixins setter in migration 
guide

### What changes were proposed in this pull request?
The Pyspark Migration Guide needs to mention a breaking change of the 
Pyspark ML API.

### Why are the changes needed?
In SPARK-29093, all setters have been removed from `Params` mixins in 
`pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML 
API, hence this is a breaking change.

### Does this PR introduce _any_ user-facing change?
Only documentation.

### How was this patch tested?
Visually.

Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters.

Authored-by: Enrico Minack 
Signed-off-by: Sean Owen 
(cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66)
Signed-off-by: Sean Owen 
---
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md
index 6f0fbbf..2c9ea41 100644
--- a/docs/pyspark-migration-guide.md
+++ b/docs/pyspark-migration-guide.md
@@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and 
DataFrame](sql-migration-guide.
 
 - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when 
constructing with named arguments for Python versions 3.6 and above, and the 
order of fields will match that as entered. To enable sorted fields by default, 
as in Spark 2.4, set the environment variable 
`PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - 
this environment variable must be consistent on all executors and driver; 
otherwise, it may cause failures or incorrect answers. For  [...]
 
+- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any 
`set*(self, value)` setter methods anymore, use the respective 
`self.set(self.*, value)` instead. See 
[SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details.
+
 ## Upgrading from PySpark 2.3 to 2.4
 
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dc0709f -> 4bbe3c2)

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dc0709f  [SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return 
relations with fresh attribute IDs
 add 4bbe3c2  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide

No new revisions were added by this update.

Summary of changes:
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 3bb0824  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide
3bb0824 is described below

commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a
Author: Enrico Minack 
AuthorDate: Wed Jun 3 18:06:13 2020 -0500

[SPARK-31853][DOCS] Mention removal of params mixins setter in migration 
guide

### What changes were proposed in this pull request?
The Pyspark Migration Guide needs to mention a breaking change of the 
Pyspark ML API.

### Why are the changes needed?
In SPARK-29093, all setters have been removed from `Params` mixins in 
`pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML 
API, hence this is a breaking change.

### Does this PR introduce _any_ user-facing change?
Only documentation.

### How was this patch tested?
Visually.

Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters.

Authored-by: Enrico Minack 
Signed-off-by: Sean Owen 
(cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66)
Signed-off-by: Sean Owen 
---
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md
index 6f0fbbf..2c9ea41 100644
--- a/docs/pyspark-migration-guide.md
+++ b/docs/pyspark-migration-guide.md
@@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and 
DataFrame](sql-migration-guide.
 
 - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when 
constructing with named arguments for Python versions 3.6 and above, and the 
order of fields will match that as entered. To enable sorted fields by default, 
as in Spark 2.4, set the environment variable 
`PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - 
this environment variable must be consistent on all executors and driver; 
otherwise, it may cause failures or incorrect answers. For  [...]
 
+- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any 
`set*(self, value)` setter methods anymore, use the respective 
`self.set(self.*, value)` instead. See 
[SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details.
+
 ## Upgrading from PySpark 2.3 to 2.4
 
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dc0709f -> 4bbe3c2)

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dc0709f  [SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return 
relations with fresh attribute IDs
 add 4bbe3c2  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide

No new revisions were added by this update.

Summary of changes:
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 3bb0824  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide
3bb0824 is described below

commit 3bb08245a9c5e9f9da6f48e65e7dc0f80ad96b4a
Author: Enrico Minack 
AuthorDate: Wed Jun 3 18:06:13 2020 -0500

[SPARK-31853][DOCS] Mention removal of params mixins setter in migration 
guide

### What changes were proposed in this pull request?
The Pyspark Migration Guide needs to mention a breaking change of the 
Pyspark ML API.

### Why are the changes needed?
In SPARK-29093, all setters have been removed from `Params` mixins in 
`pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML 
API, hence this is a breaking change.

### Does this PR introduce _any_ user-facing change?
Only documentation.

### How was this patch tested?
Visually.

Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters.

Authored-by: Enrico Minack 
Signed-off-by: Sean Owen 
(cherry picked from commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66)
Signed-off-by: Sean Owen 
---
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md
index 6f0fbbf..2c9ea41 100644
--- a/docs/pyspark-migration-guide.md
+++ b/docs/pyspark-migration-guide.md
@@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and 
DataFrame](sql-migration-guide.
 
 - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when 
constructing with named arguments for Python versions 3.6 and above, and the 
order of fields will match that as entered. To enable sorted fields by default, 
as in Spark 2.4, set the environment variable 
`PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - 
this environment variable must be consistent on all executors and driver; 
otherwise, it may cause failures or incorrect answers. For  [...]
 
+- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any 
`set*(self, value)` setter methods anymore, use the respective 
`self.set(self.*, value)` instead. See 
[SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details.
+
 ## Upgrading from PySpark 2.3 to 2.4
 
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dc0709f -> 4bbe3c2)

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dc0709f  [SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return 
relations with fresh attribute IDs
 add 4bbe3c2  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide

No new revisions were added by this update.

Summary of changes:
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide

2020-06-03 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4bbe3c2  [SPARK-31853][DOCS] Mention removal of params mixins setter 
in migration guide
4bbe3c2 is described below

commit 4bbe3c2bb49030256d7e4f6941dd5629ee6d5b66
Author: Enrico Minack 
AuthorDate: Wed Jun 3 18:06:13 2020 -0500

[SPARK-31853][DOCS] Mention removal of params mixins setter in migration 
guide

### What changes were proposed in this pull request?
The Pyspark Migration Guide needs to mention a breaking change of the 
Pyspark ML API.

### Why are the changes needed?
In SPARK-29093, all setters have been removed from `Params` mixins in 
`pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML 
API, hence this is a breaking change.

### Does this PR introduce _any_ user-facing change?
Only documentation.

### How was this patch tested?
Visually.

Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters.

Authored-by: Enrico Minack 
Signed-off-by: Sean Owen 
---
 docs/pyspark-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/pyspark-migration-guide.md b/docs/pyspark-migration-guide.md
index 6f0fbbf..2c9ea41 100644
--- a/docs/pyspark-migration-guide.md
+++ b/docs/pyspark-migration-guide.md
@@ -45,6 +45,8 @@ Please refer [Migration Guide: SQL, Datasets and 
DataFrame](sql-migration-guide.
 
 - As of Spark 3.0, `Row` field names are no longer sorted alphabetically when 
constructing with named arguments for Python versions 3.6 and above, and the 
order of fields will match that as entered. To enable sorted fields by default, 
as in Spark 2.4, set the environment variable 
`PYSPARK_ROW_FIELD_SORTING_ENABLED` to `true` for both executors and driver - 
this environment variable must be consistent on all executors and driver; 
otherwise, it may cause failures or incorrect answers. For  [...]
 
+- In Spark 3.0, `pyspark.ml.param.shared.Has*` mixins do not provide any 
`set*(self, value)` setter methods anymore, use the respective 
`self.set(self.*, value)` instead. See 
[SPARK-29093](https://issues.apache.org/jira/browse/SPARK-29093) for details.
+
 ## Upgrading from PySpark 2.3 to 2.4
 
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` just 
failed when Arrow optimization is unable to be used whereas `createDataFrame` 
from Pandas DataFrame allowed the fallback to non-optimization. Now, both 
`toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by 
default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d79a8a8 -> e5c3463)

2020-06-02 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d79a8a8  [SPARK-31834][SQL] Improve error message for incompatible 
data types
 add e5c3463  [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0

No new revisions were added by this update.

Summary of changes:
 core/pom.xml  |  2 +-
 core/src/main/scala/org/apache/spark/ui/JettyUtils.scala  |  7 ++-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala |  2 +-
 pom.xml   | 10 +-
 sql/core/pom.xml  |  2 +-
 sql/hive-thriftserver/pom.xml |  2 +-
 streaming/pom.xml |  2 +-
 7 files changed, 16 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d79a8a8 -> e5c3463)

2020-06-02 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d79a8a8  [SPARK-31834][SQL] Improve error message for incompatible 
data types
 add e5c3463  [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0

No new revisions were added by this update.

Summary of changes:
 core/pom.xml  |  2 +-
 core/src/main/scala/org/apache/spark/ui/JettyUtils.scala  |  7 ++-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala |  2 +-
 pom.xml   | 10 +-
 sql/core/pom.xml  |  2 +-
 sql/hive-thriftserver/pom.xml |  2 +-
 streaming/pom.xml |  2 +-
 7 files changed, 16 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d79a8a8 -> e5c3463)

2020-06-02 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d79a8a8  [SPARK-31834][SQL] Improve error message for incompatible 
data types
 add e5c3463  [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0

No new revisions were added by this update.

Summary of changes:
 core/pom.xml  |  2 +-
 core/src/main/scala/org/apache/spark/ui/JettyUtils.scala  |  7 ++-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala |  2 +-
 pom.xml   | 10 +-
 sql/core/pom.xml  |  2 +-
 sql/hive-thriftserver/pom.xml |  2 +-
 streaming/pom.xml |  2 +-
 7 files changed, 16 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d79a8a8 -> e5c3463)

2020-06-02 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d79a8a8  [SPARK-31834][SQL] Improve error message for incompatible 
data types
 add e5c3463  [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0

No new revisions were added by this update.

Summary of changes:
 core/pom.xml  |  2 +-
 core/src/main/scala/org/apache/spark/ui/JettyUtils.scala  |  7 ++-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala |  2 +-
 pom.xml   | 10 +-
 sql/core/pom.xml  |  2 +-
 sql/hive-thriftserver/pom.xml |  2 +-
 streaming/pom.xml |  2 +-
 7 files changed, 16 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d79a8a8 -> e5c3463)

2020-06-02 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d79a8a8  [SPARK-31834][SQL] Improve error message for incompatible 
data types
 add e5c3463  [SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0

No new revisions were added by this update.

Summary of changes:
 core/pom.xml  |  2 +-
 core/src/main/scala/org/apache/spark/ui/JettyUtils.scala  |  7 ++-
 core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala |  2 +-
 pom.xml   | 10 +-
 sql/core/pom.xml  |  2 +-
 sql/hive-thriftserver/pom.xml |  2 +-
 streaming/pom.xml |  2 +-
 7 files changed, 16 insertions(+), 11 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (e70df2c -> 6a895d0)

2020-06-01 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e70df2c  [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM 
is having issue
 add 6a895d0  [SPARK-31804][WEBUI] Add real headless browser support for 
HistoryServer tests

No new revisions were added by this update.

Summary of changes:
 .../history/ChromeUIHistoryServerSuite.scala}  |   7 +-
 .../spark/deploy/history/HistoryServerSuite.scala  |  62 -
 .../history/RealBrowserUIHistoryServerSuite.scala  | 155 +
 3 files changed, 159 insertions(+), 65 deletions(-)
 copy core/src/test/scala/org/apache/spark/{ui/ChromeUISeleniumSuite.scala => 
deploy/history/ChromeUIHistoryServerSuite.scala} (88%)
 create mode 100644 
core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (e70df2c -> 6a895d0)

2020-06-01 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from e70df2c  [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM 
is having issue
 add 6a895d0  [SPARK-31804][WEBUI] Add real headless browser support for 
HistoryServer tests

No new revisions were added by this update.

Summary of changes:
 .../history/ChromeUIHistoryServerSuite.scala}  |   7 +-
 .../spark/deploy/history/HistoryServerSuite.scala  |  62 -
 .../history/RealBrowserUIHistoryServerSuite.scala  | 155 +
 3 files changed, 159 insertions(+), 65 deletions(-)
 copy core/src/test/scala/org/apache/spark/{ui/ChromeUISeleniumSuite.scala => 
deploy/history/ChromeUIHistoryServerSuite.scala} (88%)
 create mode 100644 
core/src/test/scala/org/apache/spark/deploy/history/RealBrowserUIHistoryServerSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (bc24c99 -> e70df2c)

2020-06-01 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from bc24c99  [SPARK-31837][CORE] Shift to the new highest locality level 
if there is when recomputeLocality
 add e70df2c  [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM 
is having issue

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (bc24c99 -> e70df2c)

2020-06-01 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from bc24c99  [SPARK-31837][CORE] Shift to the new highest locality level 
if there is when recomputeLocality
 add e70df2c  [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM 
is having issue

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue

2020-06-01 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e70df2c  [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM 
is having issue
e70df2c is described below

commit e70df2cea46f71461d8d401a420e946f999862c1
Author: Yuexin Zhang 
AuthorDate: Mon Jun 1 09:46:18 2020 -0500

[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having 
issue

### What changes were proposed in this pull request?

Improve the check logic on if all node managers are really being backlisted.

### Why are the changes needed?

I observed when the AM is out of sync with ResourceManager, or RM is having 
issue report back with current number of available NMs, something like below 
happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of 
File Exception between local host is: "client.zyx.com/x.x.x.124"; destination 
host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see:  
http://wiki.apache.org/hadoop/EOFException, while invoking 
ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover 
immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync 
with ResourceManager, hence resyncing.
...

then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, 
exitCode: 11, (reason: Due to executor failures all available nodes are 
blacklisted)
...

but actually there's no black listed nodes in currentBlacklistedYarnNodes, 
and I do not see any blacklisting message from:

https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119

We should only return isAllNodeBlacklisted =true when we see there are >0  
numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A minor change. No changes on tests.

Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.

Authored-by: Yuexin Zhang 
Signed-off-by: Sean Owen 
---
 .../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
index fa8c961..339d371 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
@@ -103,7 +103,14 @@ private[spark] class YarnAllocatorBlacklistTracker(
 refreshBlacklistedNodes()
   }
 
-  def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= 
numClusterNodes
+  def isAllNodeBlacklisted: Boolean = {
+if (numClusterNodes <= 0) {
+  logWarning("No available nodes reported, please check Resource Manager.")
+  false
+} else {
+  currentBlacklistedYarnNodes.size >= numClusterNodes
+}
+  }
 
   private def refreshBlacklistedNodes(): Unit = {
 removeExpiredYarnBlacklistedNodes()


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (47dc332 -> 45cf5e9)

2020-05-31 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 47dc332  [SPARK-31874][SQL] Use `FastDateFormat` as the legacy 
fractional formatter
 add 45cf5e9  [SPARK-31840][ML] Add instance weight support in 
LogisticRegressionSummary

No new revisions were added by this update.

Summary of changes:
 .../ml/classification/LogisticRegression.scala | 99 +-
 .../classification/LogisticRegressionSuite.scala   | 61 +
 project/MimaExcludes.scala |  6 +-
 python/pyspark/ml/classification.py| 11 +++
 4 files changed, 134 insertions(+), 43 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (47dc332 -> 45cf5e9)

2020-05-31 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 47dc332  [SPARK-31874][SQL] Use `FastDateFormat` as the legacy 
fractional formatter
 add 45cf5e9  [SPARK-31840][ML] Add instance weight support in 
LogisticRegressionSummary

No new revisions were added by this update.

Summary of changes:
 .../ml/classification/LogisticRegression.scala | 99 +-
 .../classification/LogisticRegressionSuite.scala   | 61 +
 project/MimaExcludes.scala |  6 +-
 python/pyspark/ml/classification.py| 11 +++
 4 files changed, 134 insertions(+), 43 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference

2020-05-30 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 5fa46eb  [SPARK-31866][SQL][DOCS] Add 
COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference
5fa46eb is described below

commit 5fa46eb3d50281943a446e6d10fc7c6621c011cd
Author: Huaxin Gao 
AuthorDate: Sat May 30 14:51:45 2020 -0500

[SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE 
Hints to SQL Reference

Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference

To make SQL reference complete

https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png;>

https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png;>

Only the the above pages are changed. The following two pages are the same 
as before.

https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png;>

https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png;>

Manually build and check

Closes #28672 from huaxingao/coalesce_hint.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
(cherry picked from commit 1b780f364bfbb46944fe805a024bb6c32f5d2dde)
Signed-off-by: Sean Owen 
---
 docs/_data/menu-sql.yaml   |  8 +--
 docs/sql-performance-tuning.md |  4 ++
 docs/sql-ref-syntax-qry-select-hints.md| 83 --
 docs/sql-ref-syntax-qry-select-join.md |  2 +-
 ...ng.md => sql-ref-syntax-qry-select-sampling.md} |  0
 ...ndow.md => sql-ref-syntax-qry-select-window.md} |  0
 docs/sql-ref-syntax-qry-select.md  |  6 +-
 docs/sql-ref-syntax-qry.md |  6 +-
 docs/sql-ref-syntax.md |  6 +-
 9 files changed, 95 insertions(+), 20 deletions(-)

diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index 289a9d3..219e680 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -171,22 +171,22 @@
   url: sql-ref-syntax-qry-select-limit.html
 - text: Common Table Expression
   url: sql-ref-syntax-qry-select-cte.html
+- text: Hints
+  url: sql-ref-syntax-qry-select-hints.html
 - text: Inline Table
   url: sql-ref-syntax-qry-select-inline-table.html
 - text: JOIN
   url: sql-ref-syntax-qry-select-join.html
-- text: Join Hints
-  url: sql-ref-syntax-qry-select-hints.html
 - text: LIKE Predicate
   url: sql-ref-syntax-qry-select-like.html
 - text: Set Operators
   url: sql-ref-syntax-qry-select-setops.html
 - text: TABLESAMPLE
-  url: sql-ref-syntax-qry-sampling.html
+  url: sql-ref-syntax-qry-select-sampling.html
 - text: Table-valued Function
   url: sql-ref-syntax-qry-select-tvf.html
 - text: Window Function
-  url: sql-ref-syntax-qry-window.html
+  url: sql-ref-syntax-qry-select-window.html
 - text: EXPLAIN
   url: sql-ref-syntax-qry-explain.html
 - text: Auxiliary Statements
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index 5b784a5..5e6f049 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -179,6 +179,8 @@ SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON 
r.key = s.key
 
 
 
+For more details please refer to the documentation of [Join 
Hints](sql-ref-syntax-qry-select-hints.html#join-hints).
+
 ## Coalesce Hints for SQL Queries
 
 Coalesce hints allows the Spark SQL users to control the number of output 
files just like the
@@ -194,6 +196,8 @@ The "REPARTITION_BY_RANGE" hint must have column names and 
a partition number is
 SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
 SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
 
+For more details please refer to the documentation of [Partitioning 
Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints).
+
 ## Adaptive Query Execution
 Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that 
makes use of the runtime statistics to choose the most efficient query 
execution plan. AQE is disabled by default. Spark SQL can use the umbrella 
configuration of `spark.sql.adaptive.enabled` to control whether turn it 
on/off. As of Spark 3.0, there are three major features in AQE, including 
coalescing post-shuffle partitions, converting sort-merge join to br

[spark] branch branch-3.0 updated: [SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference

2020-05-30 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 5fa46eb  [SPARK-31866][SQL][DOCS] Add 
COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference
5fa46eb is described below

commit 5fa46eb3d50281943a446e6d10fc7c6621c011cd
Author: Huaxin Gao 
AuthorDate: Sat May 30 14:51:45 2020 -0500

[SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE 
Hints to SQL Reference

Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference

To make SQL reference complete

https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png;>

https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png;>

Only the the above pages are changed. The following two pages are the same 
as before.

https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png;>

https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png;>

Manually build and check

Closes #28672 from huaxingao/coalesce_hint.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
(cherry picked from commit 1b780f364bfbb46944fe805a024bb6c32f5d2dde)
Signed-off-by: Sean Owen 
---
 docs/_data/menu-sql.yaml   |  8 +--
 docs/sql-performance-tuning.md |  4 ++
 docs/sql-ref-syntax-qry-select-hints.md| 83 --
 docs/sql-ref-syntax-qry-select-join.md |  2 +-
 ...ng.md => sql-ref-syntax-qry-select-sampling.md} |  0
 ...ndow.md => sql-ref-syntax-qry-select-window.md} |  0
 docs/sql-ref-syntax-qry-select.md  |  6 +-
 docs/sql-ref-syntax-qry.md |  6 +-
 docs/sql-ref-syntax.md |  6 +-
 9 files changed, 95 insertions(+), 20 deletions(-)

diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
index 289a9d3..219e680 100644
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@@ -171,22 +171,22 @@
   url: sql-ref-syntax-qry-select-limit.html
 - text: Common Table Expression
   url: sql-ref-syntax-qry-select-cte.html
+- text: Hints
+  url: sql-ref-syntax-qry-select-hints.html
 - text: Inline Table
   url: sql-ref-syntax-qry-select-inline-table.html
 - text: JOIN
   url: sql-ref-syntax-qry-select-join.html
-- text: Join Hints
-  url: sql-ref-syntax-qry-select-hints.html
 - text: LIKE Predicate
   url: sql-ref-syntax-qry-select-like.html
 - text: Set Operators
   url: sql-ref-syntax-qry-select-setops.html
 - text: TABLESAMPLE
-  url: sql-ref-syntax-qry-sampling.html
+  url: sql-ref-syntax-qry-select-sampling.html
 - text: Table-valued Function
   url: sql-ref-syntax-qry-select-tvf.html
 - text: Window Function
-  url: sql-ref-syntax-qry-window.html
+  url: sql-ref-syntax-qry-select-window.html
 - text: EXPLAIN
   url: sql-ref-syntax-qry-explain.html
 - text: Auxiliary Statements
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index 5b784a5..5e6f049 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -179,6 +179,8 @@ SELECT /*+ BROADCAST(r) */ * FROM records r JOIN src s ON 
r.key = s.key
 
 
 
+For more details please refer to the documentation of [Join 
Hints](sql-ref-syntax-qry-select-hints.html#join-hints).
+
 ## Coalesce Hints for SQL Queries
 
 Coalesce hints allows the Spark SQL users to control the number of output 
files just like the
@@ -194,6 +196,8 @@ The "REPARTITION_BY_RANGE" hint must have column names and 
a partition number is
 SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
 SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
 
+For more details please refer to the documentation of [Partitioning 
Hints](sql-ref-syntax-qry-select-hints.html#partitioning-hints).
+
 ## Adaptive Query Execution
 Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that 
makes use of the runtime statistics to choose the most efficient query 
execution plan. AQE is disabled by default. Spark SQL can use the umbrella 
configuration of `spark.sql.adaptive.enabled` to control whether turn it 
on/off. As of Spark 3.0, there are three major features in AQE, including 
coalescing post-shuffle partitions, converting sort-merge join to br

[spark] branch master updated (b9737c3 -> 1b780f3)

2020-05-30 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b9737c3  [SPARK-31864][SQL] Adjust AQE skew join trigger condition
 add 1b780f3  [SPARK-31866][SQL][DOCS] Add 
COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference

No new revisions were added by this update.

Summary of changes:
 docs/_data/menu-sql.yaml   |  8 +--
 docs/sql-performance-tuning.md |  4 +-
 docs/sql-ref-syntax-qry-select-hints.md| 83 --
 docs/sql-ref-syntax-qry-select-join.md |  2 +-
 ...ng.md => sql-ref-syntax-qry-select-sampling.md} |  0
 ...ndow.md => sql-ref-syntax-qry-select-window.md} |  0
 docs/sql-ref-syntax-qry-select.md  |  6 +-
 docs/sql-ref-syntax-qry.md |  6 +-
 docs/sql-ref-syntax.md |  6 +-
 9 files changed, 94 insertions(+), 21 deletions(-)
 rename docs/{sql-ref-syntax-qry-sampling.md => 
sql-ref-syntax-qry-select-sampling.md} (100%)
 rename docs/{sql-ref-syntax-qry-window.md => 
sql-ref-syntax-qry-select-window.md} (100%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b9737c3 -> 1b780f3)

2020-05-30 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b9737c3  [SPARK-31864][SQL] Adjust AQE skew join trigger condition
 add 1b780f3  [SPARK-31866][SQL][DOCS] Add 
COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference

No new revisions were added by this update.

Summary of changes:
 docs/_data/menu-sql.yaml   |  8 +--
 docs/sql-performance-tuning.md |  4 +-
 docs/sql-ref-syntax-qry-select-hints.md| 83 --
 docs/sql-ref-syntax-qry-select-join.md |  2 +-
 ...ng.md => sql-ref-syntax-qry-select-sampling.md} |  0
 ...ndow.md => sql-ref-syntax-qry-select-window.md} |  0
 docs/sql-ref-syntax-qry-select.md  |  6 +-
 docs/sql-ref-syntax-qry.md |  6 +-
 docs/sql-ref-syntax.md |  6 +-
 9 files changed, 94 insertions(+), 21 deletions(-)
 rename docs/{sql-ref-syntax-qry-sampling.md => 
sql-ref-syntax-qry-select-sampling.md} (100%)
 rename docs/{sql-ref-syntax-qry-window.md => 
sql-ref-syntax-qry-select-window.md} (100%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (765105b -> 50492c0)

2020-05-27 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 765105b  [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages
 add 50492c0  [SPARK-31803][ML] Make sure instance weight is not negative

No new revisions were added by this update.

Summary of changes:
 mllib/src/main/scala/org/apache/spark/ml/Predictor.scala   | 3 ++-
 .../main/scala/org/apache/spark/ml/classification/NaiveBayes.scala | 5 +++--
 .../scala/org/apache/spark/ml/clustering/BisectingKMeans.scala | 3 ++-
 .../scala/org/apache/spark/ml/clustering/GaussianMixture.scala | 3 ++-
 mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala   | 3 ++-
 .../apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala | 3 ++-
 .../scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala | 3 ++-
 .../scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala   | 2 --
 .../spark/ml/evaluation/MulticlassClassificationEvaluator.scala| 3 ++-
 .../scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala | 4 +++-
 mllib/src/main/scala/org/apache/spark/ml/functions.scala   | 6 ++
 .../apache/spark/ml/regression/GeneralizedLinearRegression.scala   | 3 ++-
 .../scala/org/apache/spark/ml/regression/IsotonicRegression.scala  | 7 ---
 13 files changed, 32 insertions(+), 16 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (765105b -> 50492c0)

2020-05-27 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 765105b  [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages
 add 50492c0  [SPARK-31803][ML] Make sure instance weight is not negative

No new revisions were added by this update.

Summary of changes:
 mllib/src/main/scala/org/apache/spark/ml/Predictor.scala   | 3 ++-
 .../main/scala/org/apache/spark/ml/classification/NaiveBayes.scala | 5 +++--
 .../scala/org/apache/spark/ml/clustering/BisectingKMeans.scala | 3 ++-
 .../scala/org/apache/spark/ml/clustering/GaussianMixture.scala | 3 ++-
 mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala   | 3 ++-
 .../apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala | 3 ++-
 .../scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala | 3 ++-
 .../scala/org/apache/spark/ml/evaluation/ClusteringMetrics.scala   | 2 --
 .../spark/ml/evaluation/MulticlassClassificationEvaluator.scala| 3 ++-
 .../scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala | 4 +++-
 mllib/src/main/scala/org/apache/spark/ml/functions.scala   | 6 ++
 .../apache/spark/ml/regression/GeneralizedLinearRegression.scala   | 3 ++-
 .../scala/org/apache/spark/ml/regression/IsotonicRegression.scala  | 7 ---
 13 files changed, 32 insertions(+), 16 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (8f2b6f3 -> 765105b)

2020-05-27 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8f2b6f3  [SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in 
schema for expression
 add 765105b  [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/ui/jobs/AllJobsPage.scala |  23 +--
 .../scala/org/apache/spark/ui/jobs/StagePage.scala |  14 +-
 .../org/apache/spark/ui/jobs/StageTable.scala  |  21 +--
 .../scala/org/apache/spark/ui/StagePageSuite.scala |   1 -
 .../spark/sql/execution/ui/AllExecutionsPage.scala |  29 +---
 .../hive/thriftserver/ui/ThriftServerPage.scala| 164 +
 6 files changed, 93 insertions(+), 159 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (8f2b6f3 -> 765105b)

2020-05-27 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8f2b6f3  [SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in 
schema for expression
 add 765105b  [SPARK-31638][WEBUI] Clean Pagination code for all webUI pages

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/ui/jobs/AllJobsPage.scala |  23 +--
 .../scala/org/apache/spark/ui/jobs/StagePage.scala |  14 +-
 .../org/apache/spark/ui/jobs/StageTable.scala  |  21 +--
 .../scala/org/apache/spark/ui/StagePageSuite.scala |   1 -
 .../spark/sql/execution/ui/AllExecutionsPage.scala |  29 +---
 .../hive/thriftserver/ui/ThriftServerPage.scala| 164 +
 6 files changed, 93 insertions(+), 159 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7f36310 -> d400777)

2020-05-25 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7f36310  [SPARK-31802][SQL] Format Java date-time types in 
`Row.jsonValue` directly
 add d400777  [SPARK-31734][ML][PYSPARK] Add weight support in 
ClusteringEvaluator

No new revisions were added by this update.

Summary of changes:
 .../spark/ml/evaluation/ClusteringEvaluator.scala  |  34 --
 .../spark/ml/evaluation/ClusteringMetrics.scala| 128 -
 .../ml/evaluation/ClusteringEvaluatorSuite.scala   |  43 ++-
 python/pyspark/ml/evaluation.py|  29 -
 4 files changed, 167 insertions(+), 67 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7f36310 -> d400777)

2020-05-25 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7f36310  [SPARK-31802][SQL] Format Java date-time types in 
`Row.jsonValue` directly
 add d400777  [SPARK-31734][ML][PYSPARK] Add weight support in 
ClusteringEvaluator

No new revisions were added by this update.

Summary of changes:
 .../spark/ml/evaluation/ClusteringEvaluator.scala  |  34 --
 .../spark/ml/evaluation/ClusteringMetrics.scala| 128 -
 .../ml/evaluation/ClusteringEvaluatorSuite.scala   |  43 ++-
 python/pyspark/ml/evaluation.py|  29 -
 4 files changed, 167 insertions(+), 67 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7f36310 -> d400777)

2020-05-25 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7f36310  [SPARK-31802][SQL] Format Java date-time types in 
`Row.jsonValue` directly
 add d400777  [SPARK-31734][ML][PYSPARK] Add weight support in 
ClusteringEvaluator

No new revisions were added by this update.

Summary of changes:
 .../spark/ml/evaluation/ClusteringEvaluator.scala  |  34 --
 .../spark/ml/evaluation/ClusteringMetrics.scala| 128 -
 .../ml/evaluation/ClusteringEvaluatorSuite.scala   |  43 ++-
 python/pyspark/ml/evaluation.py|  29 -
 4 files changed, 167 insertions(+), 67 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator

2020-05-25 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d400777  [SPARK-31734][ML][PYSPARK] Add weight support in 
ClusteringEvaluator
d400777 is described below

commit d4007776f2dd85f03f3811ab8ca711f221f62c00
Author: Huaxin Gao 
AuthorDate: Mon May 25 09:18:08 2020 -0500

[SPARK-31734][ML][PYSPARK] Add weight support in ClusteringEvaluator

### What changes were proposed in this pull request?
Add weight support in ClusteringEvaluator

### Why are the changes needed?
Currently, BinaryClassificationEvaluator, RegressionEvaluator, and 
MulticlassClassificationEvaluator support instance weight, but 
ClusteringEvaluator doesn't, so we will add instance weight support in 
ClusteringEvaluator.

### Does this PR introduce _any_ user-facing change?
Yes.
ClusteringEvaluator.setWeightCol

### How was this patch tested?
add new unit test

Closes #28553 from huaxingao/weight_evaluator.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
---
 .../spark/ml/evaluation/ClusteringEvaluator.scala  |  34 --
 .../spark/ml/evaluation/ClusteringMetrics.scala| 128 -
 .../ml/evaluation/ClusteringEvaluatorSuite.scala   |  43 ++-
 python/pyspark/ml/evaluation.py|  29 -
 4 files changed, 167 insertions(+), 67 deletions(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
index 63b99a0..19790fd 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
@@ -19,10 +19,11 @@ package org.apache.spark.ml.evaluation
 
 import org.apache.spark.annotation.Since
 import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
-import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol, 
HasWeightCol}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.Dataset
-import org.apache.spark.sql.functions.col
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.DoubleType
 
 /**
  * Evaluator for clustering results.
@@ -34,7 +35,8 @@ import org.apache.spark.sql.functions.col
  */
 @Since("2.3.0")
 class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: 
String)
-  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with HasWeightCol
+with DefaultParamsWritable {
 
   @Since("2.3.0")
   def this() = this(Identifiable.randomUID("cluEval"))
@@ -53,6 +55,10 @@ class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") 
override val uid: Str
   @Since("2.3.0")
   def setFeaturesCol(value: String): this.type = set(featuresCol, value)
 
+  /** @group setParam */
+  @Since("3.1.0")
+  def setWeightCol(value: String): this.type = set(weightCol, value)
+
   /**
* param for metric name in evaluation
* (supports `"silhouette"` (default))
@@ -116,12 +122,26 @@ class ClusteringEvaluator @Since("2.3.0") 
(@Since("2.3.0") override val uid: Str
*/
   @Since("3.1.0")
   def getMetrics(dataset: Dataset[_]): ClusteringMetrics = {
-SchemaUtils.validateVectorCompatibleColumn(dataset.schema, $(featuresCol))
-SchemaUtils.checkNumericType(dataset.schema, $(predictionCol))
+val schema = dataset.schema
+SchemaUtils.validateVectorCompatibleColumn(schema, $(featuresCol))
+SchemaUtils.checkNumericType(schema, $(predictionCol))
+if (isDefined(weightCol)) {
+  SchemaUtils.checkNumericType(schema, $(weightCol))
+}
+
+val weightColName = if (!isDefined(weightCol)) "weightCol" else 
$(weightCol)
 
 val vectorCol = DatasetUtils.columnToVector(dataset, $(featuresCol))
-val df = dataset.select(col($(predictionCol)),
-  vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata))
+val df = if (!isDefined(weightCol) || $(weightCol).isEmpty) {
+  dataset.select(col($(predictionCol)),
+vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata),
+lit(1.0).as(weightColName))
+} else {
+  dataset.select(col($(predictionCol)),
+vectorCol.as($(featuresCol), dataset.schema($(featuresCol)).metadata),
+col(weightColName).cast(DoubleType))
+}
+
 val metrics = new ClusteringMetrics(df)
 metrics.setDistanceMeasure($(distanceMeasure))
 metrics
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/Clust

[spark] branch master updated (cf7463f -> d0fe433)

2020-05-24 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from cf7463f  [SPARK-31761][SQL] cast integer to Long to avoid 
IntegerOverflow for IntegralDivide  operator
 add d0fe433  [SPARK-31768][ML] add getMetrics in Evaluators

No new revisions were added by this update.

Summary of changes:
 .../evaluation/BinaryClassificationEvaluator.scala |  26 +-
 .../spark/ml/evaluation/ClusteringEvaluator.scala  | 559 +
 ...ringEvaluator.scala => ClusteringMetrics.scala} | 173 ++-
 .../MulticlassClassificationEvaluator.scala|  54 +-
 .../MultilabelClassificationEvaluator.scala|  36 +-
 .../spark/ml/evaluation/RankingEvaluator.scala |  28 +-
 .../spark/ml/evaluation/RegressionEvaluator.scala  |  28 +-
 .../spark/mllib/evaluation/MulticlassMetrics.scala |   3 +-
 .../BinaryClassificationEvaluatorSuite.scala   |  23 +
 .../ml/evaluation/ClusteringEvaluatorSuite.scala   |  16 +
 .../MulticlassClassificationEvaluatorSuite.scala   |  29 ++
 .../MultilabelClassificationEvaluatorSuite.scala   |  48 ++
 .../ml/evaluation/RankingEvaluatorSuite.scala  |  38 ++
 .../ml/evaluation/RegressionEvaluatorSuite.scala   |  33 ++
 14 files changed, 375 insertions(+), 719 deletions(-)
 copy 
mllib/src/main/scala/org/apache/spark/ml/evaluation/{ClusteringEvaluator.scala 
=> ClusteringMetrics.scala} (80%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (cf7463f -> d0fe433)

2020-05-24 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from cf7463f  [SPARK-31761][SQL] cast integer to Long to avoid 
IntegerOverflow for IntegralDivide  operator
 add d0fe433  [SPARK-31768][ML] add getMetrics in Evaluators

No new revisions were added by this update.

Summary of changes:
 .../evaluation/BinaryClassificationEvaluator.scala |  26 +-
 .../spark/ml/evaluation/ClusteringEvaluator.scala  | 559 +
 ...ringEvaluator.scala => ClusteringMetrics.scala} | 173 ++-
 .../MulticlassClassificationEvaluator.scala|  54 +-
 .../MultilabelClassificationEvaluator.scala|  36 +-
 .../spark/ml/evaluation/RankingEvaluator.scala |  28 +-
 .../spark/ml/evaluation/RegressionEvaluator.scala  |  28 +-
 .../spark/mllib/evaluation/MulticlassMetrics.scala |   3 +-
 .../BinaryClassificationEvaluatorSuite.scala   |  23 +
 .../ml/evaluation/ClusteringEvaluatorSuite.scala   |  16 +
 .../MulticlassClassificationEvaluatorSuite.scala   |  29 ++
 .../MultilabelClassificationEvaluatorSuite.scala   |  48 ++
 .../ml/evaluation/RankingEvaluatorSuite.scala  |  38 ++
 .../ml/evaluation/RegressionEvaluatorSuite.scala   |  33 ++
 14 files changed, 375 insertions(+), 719 deletions(-)
 copy 
mllib/src/main/scala/org/apache/spark/ml/evaluation/{ClusteringEvaluator.scala 
=> ClusteringMetrics.scala} (80%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (892b600 -> d955708)

2020-05-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 892b600  [SPARK-31790][DOCS] cast(long as timestamp) show different 
result between Hive and Spark
 add d955708  [SPARK-31756][WEBUI] Add real headless browser support for UI 
test

No new revisions were added by this update.

Summary of changes:
 .../tags/{DockerTest.java => ChromeUITest.java}|   3 +-
 .../apache/spark/ui/ChromeUISeleniumSuite.scala|  29 +++---
 .../spark/ui/RealBrowserUISeleniumSuite.scala  | 109 +
 .../org/apache/spark/ui/UISeleniumSuite.scala  |  27 -
 dev/run-tests.py   |   5 +
 pom.xml|   2 +
 6 files changed, 132 insertions(+), 43 deletions(-)
 copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => 
ChromeUITest.java} (96%)
 copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => 
core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%)
 create mode 100644 
core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (892b600 -> d955708)

2020-05-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 892b600  [SPARK-31790][DOCS] cast(long as timestamp) show different 
result between Hive and Spark
 add d955708  [SPARK-31756][WEBUI] Add real headless browser support for UI 
test

No new revisions were added by this update.

Summary of changes:
 .../tags/{DockerTest.java => ChromeUITest.java}|   3 +-
 .../apache/spark/ui/ChromeUISeleniumSuite.scala|  29 +++---
 .../spark/ui/RealBrowserUISeleniumSuite.scala  | 109 +
 .../org/apache/spark/ui/UISeleniumSuite.scala  |  27 -
 dev/run-tests.py   |   5 +
 pom.xml|   2 +
 6 files changed, 132 insertions(+), 43 deletions(-)
 copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => 
ChromeUITest.java} (96%)
 copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => 
core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%)
 create mode 100644 
core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (892b600 -> d955708)

2020-05-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 892b600  [SPARK-31790][DOCS] cast(long as timestamp) show different 
result between Hive and Spark
 add d955708  [SPARK-31756][WEBUI] Add real headless browser support for UI 
test

No new revisions were added by this update.

Summary of changes:
 .../tags/{DockerTest.java => ChromeUITest.java}|   3 +-
 .../apache/spark/ui/ChromeUISeleniumSuite.scala|  29 +++---
 .../spark/ui/RealBrowserUISeleniumSuite.scala  | 109 +
 .../org/apache/spark/ui/UISeleniumSuite.scala  |  27 -
 dev/run-tests.py   |   5 +
 pom.xml|   2 +
 6 files changed, 132 insertions(+), 43 deletions(-)
 copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => 
ChromeUITest.java} (96%)
 copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => 
core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%)
 create mode 100644 
core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (892b600 -> d955708)

2020-05-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 892b600  [SPARK-31790][DOCS] cast(long as timestamp) show different 
result between Hive and Spark
 add d955708  [SPARK-31756][WEBUI] Add real headless browser support for UI 
test

No new revisions were added by this update.

Summary of changes:
 .../tags/{DockerTest.java => ChromeUITest.java}|   3 +-
 .../apache/spark/ui/ChromeUISeleniumSuite.scala|  29 +++---
 .../spark/ui/RealBrowserUISeleniumSuite.scala  | 109 +
 .../org/apache/spark/ui/UISeleniumSuite.scala  |  27 -
 dev/run-tests.py   |   5 +
 pom.xml|   2 +
 6 files changed, 132 insertions(+), 43 deletions(-)
 copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => 
ChromeUITest.java} (96%)
 copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => 
core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%)
 create mode 100644 
core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (892b600 -> d955708)

2020-05-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 892b600  [SPARK-31790][DOCS] cast(long as timestamp) show different 
result between Hive and Spark
 add d955708  [SPARK-31756][WEBUI] Add real headless browser support for UI 
test

No new revisions were added by this update.

Summary of changes:
 .../tags/{DockerTest.java => ChromeUITest.java}|   3 +-
 .../apache/spark/ui/ChromeUISeleniumSuite.scala|  29 +++---
 .../spark/ui/RealBrowserUISeleniumSuite.scala  | 109 +
 .../org/apache/spark/ui/UISeleniumSuite.scala  |  27 -
 dev/run-tests.py   |   5 +
 pom.xml|   2 +
 6 files changed, 132 insertions(+), 43 deletions(-)
 copy common/tags/src/test/java/org/apache/spark/tags/{DockerTest.java => 
ChromeUITest.java} (96%)
 copy mllib/src/test/scala/org/apache/spark/ml/util/TempDirectory.scala => 
core/src/test/scala/org/apache/spark/ui/ChromeUISeleniumSuite.scala (62%)
 create mode 100644 
core/src/test/scala/org/apache/spark/ui/RealBrowserUISeleniumSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dae7988 -> f1495c5)

2020-05-21 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dae7988  [SPARK-31354] SparkContext only register one SparkSession 
ApplicationEnd listener
 add f1495c5  [SPARK-31688][WEBUI] Refactor Pagination framework

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/ui/PagedTable.scala | 101 -
 .../org/apache/spark/ui/jobs/AllJobsPage.scala | 128 ++-
 .../org/apache/spark/ui/jobs/StageTable.scala  | 114 ++
 .../org/apache/spark/ui/storage/RDDPage.scala  |  64 ++
 .../spark/sql/execution/ui/AllExecutionsPage.scala | 123 ++
 .../hive/thriftserver/ui/ThriftServerPage.scala| 251 +
 .../thriftserver/ui/ThriftServerSessionPage.scala  |  29 +--
 7 files changed, 238 insertions(+), 572 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dae7988 -> f1495c5)

2020-05-21 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dae7988  [SPARK-31354] SparkContext only register one SparkSession 
ApplicationEnd listener
 add f1495c5  [SPARK-31688][WEBUI] Refactor Pagination framework

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/ui/PagedTable.scala | 101 -
 .../org/apache/spark/ui/jobs/AllJobsPage.scala | 128 ++-
 .../org/apache/spark/ui/jobs/StageTable.scala  | 114 ++
 .../org/apache/spark/ui/storage/RDDPage.scala  |  64 ++
 .../spark/sql/execution/ui/AllExecutionsPage.scala | 123 ++
 .../hive/thriftserver/ui/ThriftServerPage.scala| 251 +
 .../thriftserver/ui/ThriftServerSessionPage.scala  |  29 +--
 7 files changed, 238 insertions(+), 572 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dae7988 -> f1495c5)

2020-05-21 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dae7988  [SPARK-31354] SparkContext only register one SparkSession 
ApplicationEnd listener
 add f1495c5  [SPARK-31688][WEBUI] Refactor Pagination framework

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/ui/PagedTable.scala | 101 -
 .../org/apache/spark/ui/jobs/AllJobsPage.scala | 128 ++-
 .../org/apache/spark/ui/jobs/StageTable.scala  | 114 ++
 .../org/apache/spark/ui/storage/RDDPage.scala  |  64 ++
 .../spark/sql/execution/ui/AllExecutionsPage.scala | 123 ++
 .../hive/thriftserver/ui/ThriftServerPage.scala| 251 +
 .../thriftserver/ui/ThriftServerSessionPage.scala  |  29 +--
 7 files changed, 238 insertions(+), 572 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (dae7988 -> f1495c5)

2020-05-21 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dae7988  [SPARK-31354] SparkContext only register one SparkSession 
ApplicationEnd listener
 add f1495c5  [SPARK-31688][WEBUI] Refactor Pagination framework

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/ui/PagedTable.scala | 101 -
 .../org/apache/spark/ui/jobs/AllJobsPage.scala | 128 ++-
 .../org/apache/spark/ui/jobs/StageTable.scala  | 114 ++
 .../org/apache/spark/ui/storage/RDDPage.scala  |  64 ++
 .../spark/sql/execution/ui/AllExecutionsPage.scala | 123 ++
 .../hive/thriftserver/ui/ThriftServerPage.scala| 251 +
 .../thriftserver/ui/ThriftServerSessionPage.scala  |  29 +--
 7 files changed, 238 insertions(+), 572 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (d2bec5e -> 097d509)

2020-05-17 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d2bec5e  [SPARK-31707][SQL] Revert SPARK-30098 Use default datasource 
as provider for CREATE TABLE syntax
 add 097d509  [MINOR] Fix a typo in FsHistoryProvider loginfo

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR] Fix a typo in FsHistoryProvider loginfo

2020-05-17 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 097d509  [MINOR] Fix a typo in FsHistoryProvider loginfo
097d509 is described below

commit 097d5098cca987e5f7bbb8394783c01517ebed0f
Author: Sungpeo Kook 
AuthorDate: Sun May 17 09:43:01 2020 -0500

[MINOR] Fix a typo in FsHistoryProvider loginfo

## What changes were proposed in this pull request?
a typo in logging. (just added `: `)

Closes #28505 from sungpeo/typo_fshistoryprovider.

Authored-by: Sungpeo Kook 
Signed-off-by: Sean Owen 
---
 .../main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
index 99d3ece..25ea75a 100644
--- 
a/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
+++ 
b/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
@@ -108,7 +108,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, 
clock: Clock)
   private val historyUiAdminAclsGroups = 
conf.get(History.HISTORY_SERVER_UI_ADMIN_ACLS_GROUPS)
   logInfo(s"History server ui acls " + (if (historyUiAclsEnable) "enabled" 
else "disabled") +
 "; users with admin permissions: " + historyUiAdminAcls.mkString(",") +
-"; groups with admin permissions" + historyUiAdminAclsGroups.mkString(","))
+"; groups with admin permissions: " + 
historyUiAdminAclsGroups.mkString(","))
 
   private val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
   // Visible for testing


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (5d90886 -> 194ac3b)

2020-05-15 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 5d90886  [SPARK-31716][SQL] Use fallback versions in 
HiveExternalCatalogVersionsSuite
 add 194ac3b  [SPARK-31708][ML][DOCS] Add docs and examples for 
ANOVASelector and FValueSelector

No new revisions were added by this update.

Summary of changes:
 docs/ml-features.md| 140 +
 docs/ml-statistics.md  |  56 -
 ...rExample.java => JavaANOVASelectorExample.java} |  35 +++---
 .../spark/examples/ml/JavaANOVATestExample.java|   2 +-
 ...Example.java => JavaFValueSelectorExample.java} |  34 ++---
 .../spark/examples/ml/JavaFValueTestExample.java   |   2 +-
 ...lector_example.py => anova_selector_example.py} |  24 ++--
 examples/src/main/python/ml/anova_test_example.py  |   2 +-
 ...ector_example.py => fvalue_selector_example.py} |  26 ++--
 examples/src/main/python/ml/fvalue_test_example.py |   2 +-
 ...torExample.scala => ANOVASelectorExample.scala} |  30 +++--
 .../spark/examples/ml/ANOVATestExample.scala   |   2 +-
 ...orExample.scala => FValueSelectorExample.scala} |  30 +++--
 ...ueTestExample.scala => FValueTestExample.scala} |   0
 14 files changed, 308 insertions(+), 77 deletions(-)
 copy 
examples/src/main/java/org/apache/spark/examples/ml/{JavaChiSqSelectorExample.java
 => JavaANOVASelectorExample.java} (66%)
 copy 
examples/src/main/java/org/apache/spark/examples/ml/{JavaVarianceThresholdSelectorExample.java
 => JavaFValueSelectorExample.java} (76%)
 copy examples/src/main/python/ml/{chisq_selector_example.py => 
anova_selector_example.py} (62%)
 copy examples/src/main/python/ml/{variance_threshold_selector_example.py => 
fvalue_selector_example.py} (58%)
 copy 
examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala
 => ANOVASelectorExample.scala} (64%)
 copy 
examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala
 => FValueSelectorExample.scala} (62%)
 rename 
examples/src/main/scala/org/apache/spark/examples/ml/{FVlaueTestExample.scala 
=> FValueTestExample.scala} (100%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (5d90886 -> 194ac3b)

2020-05-15 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 5d90886  [SPARK-31716][SQL] Use fallback versions in 
HiveExternalCatalogVersionsSuite
 add 194ac3b  [SPARK-31708][ML][DOCS] Add docs and examples for 
ANOVASelector and FValueSelector

No new revisions were added by this update.

Summary of changes:
 docs/ml-features.md| 140 +
 docs/ml-statistics.md  |  56 -
 ...rExample.java => JavaANOVASelectorExample.java} |  35 +++---
 .../spark/examples/ml/JavaANOVATestExample.java|   2 +-
 ...Example.java => JavaFValueSelectorExample.java} |  34 ++---
 .../spark/examples/ml/JavaFValueTestExample.java   |   2 +-
 ...lector_example.py => anova_selector_example.py} |  24 ++--
 examples/src/main/python/ml/anova_test_example.py  |   2 +-
 ...ector_example.py => fvalue_selector_example.py} |  26 ++--
 examples/src/main/python/ml/fvalue_test_example.py |   2 +-
 ...torExample.scala => ANOVASelectorExample.scala} |  30 +++--
 .../spark/examples/ml/ANOVATestExample.scala   |   2 +-
 ...orExample.scala => FValueSelectorExample.scala} |  30 +++--
 ...ueTestExample.scala => FValueTestExample.scala} |   0
 14 files changed, 308 insertions(+), 77 deletions(-)
 copy 
examples/src/main/java/org/apache/spark/examples/ml/{JavaChiSqSelectorExample.java
 => JavaANOVASelectorExample.java} (66%)
 copy 
examples/src/main/java/org/apache/spark/examples/ml/{JavaVarianceThresholdSelectorExample.java
 => JavaFValueSelectorExample.java} (76%)
 copy examples/src/main/python/ml/{chisq_selector_example.py => 
anova_selector_example.py} (62%)
 copy examples/src/main/python/ml/{variance_threshold_selector_example.py => 
fvalue_selector_example.py} (58%)
 copy 
examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala
 => ANOVASelectorExample.scala} (64%)
 copy 
examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala
 => FValueSelectorExample.scala} (62%)
 rename 
examples/src/main/scala/org/apache/spark/examples/ml/{FVlaueTestExample.scala 
=> FValueTestExample.scala} (100%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (5d90886 -> 194ac3b)

2020-05-15 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 5d90886  [SPARK-31716][SQL] Use fallback versions in 
HiveExternalCatalogVersionsSuite
 add 194ac3b  [SPARK-31708][ML][DOCS] Add docs and examples for 
ANOVASelector and FValueSelector

No new revisions were added by this update.

Summary of changes:
 docs/ml-features.md| 140 +
 docs/ml-statistics.md  |  56 -
 ...rExample.java => JavaANOVASelectorExample.java} |  35 +++---
 .../spark/examples/ml/JavaANOVATestExample.java|   2 +-
 ...Example.java => JavaFValueSelectorExample.java} |  34 ++---
 .../spark/examples/ml/JavaFValueTestExample.java   |   2 +-
 ...lector_example.py => anova_selector_example.py} |  24 ++--
 examples/src/main/python/ml/anova_test_example.py  |   2 +-
 ...ector_example.py => fvalue_selector_example.py} |  26 ++--
 examples/src/main/python/ml/fvalue_test_example.py |   2 +-
 ...torExample.scala => ANOVASelectorExample.scala} |  30 +++--
 .../spark/examples/ml/ANOVATestExample.scala   |   2 +-
 ...orExample.scala => FValueSelectorExample.scala} |  30 +++--
 ...ueTestExample.scala => FValueTestExample.scala} |   0
 14 files changed, 308 insertions(+), 77 deletions(-)
 copy 
examples/src/main/java/org/apache/spark/examples/ml/{JavaChiSqSelectorExample.java
 => JavaANOVASelectorExample.java} (66%)
 copy 
examples/src/main/java/org/apache/spark/examples/ml/{JavaVarianceThresholdSelectorExample.java
 => JavaFValueSelectorExample.java} (76%)
 copy examples/src/main/python/ml/{chisq_selector_example.py => 
anova_selector_example.py} (62%)
 copy examples/src/main/python/ml/{variance_threshold_selector_example.py => 
fvalue_selector_example.py} (58%)
 copy 
examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala
 => ANOVASelectorExample.scala} (64%)
 copy 
examples/src/main/scala/org/apache/spark/examples/ml/{ChiSqSelectorExample.scala
 => FValueSelectorExample.scala} (62%)
 rename 
examples/src/main/scala/org/apache/spark/examples/ml/{FVlaueTestExample.scala 
=> FValueTestExample.scala} (100%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector

2020-05-15 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 194ac3b  [SPARK-31708][ML][DOCS] Add docs and examples for 
ANOVASelector and FValueSelector
194ac3b is described below

commit 194ac3be8bd8ca1b5e463074ed61420f185e8caf
Author: Huaxin Gao 
AuthorDate: Fri May 15 09:59:14 2020 -0500

[SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and 
FValueSelector

### What changes were proposed in this pull request?
Add docs and examples for ANOVASelector and FValueSelector

### Why are the changes needed?
Complete the implementation of ANOVASelector and FValueSelector

### Does this PR introduce _any_ user-facing change?
Yes

https://user-images.githubusercontent.com/13592258/81878703-b4f94480-953d-11ea-9166-da3c64852b90.png;>

https://user-images.githubusercontent.com/13592258/81878600-6055c980-953d-11ea-8b24-09c31647139b.png;>

https://user-images.githubusercontent.com/13592258/81878603-621f8d00-953d-11ea-9447-39913ccc067d.png;>

https://user-images.githubusercontent.com/13592258/81878606-65b31400-953d-11ea-9d76-51859266d1a8.png;>

https://user-images.githubusercontent.com/13592258/81878611-69df3180-953d-11ea-8618-23a2a6cfd730.png;>

https://user-images.githubusercontent.com/13592258/81878620-6cda2200-953d-11ea-9c46-da763328364e.png;>

https://user-images.githubusercontent.com/13592258/81878625-6f3c7c00-953d-11ea-9d11-2281b33a0bd8.png;>

https://user-images.githubusercontent.com/13592258/81878882-13bebe00-953e-11ea-9776-288bac97d93f.png;>

https://user-images.githubusercontent.com/13592258/81878637-76638a00-953d-11ea-94b0-dc9bc85ae2b7.png;>

https://user-images.githubusercontent.com/13592258/81878640-79f71100-953d-11ea-9a66-b27f9482fbd3.png;>

https://user-images.githubusercontent.com/13592258/81878644-7cf20180-953d-11ea-9142-9658c8e90986.png;>

https://user-images.githubusercontent.com/13592258/81878653-81b6b580-953d-11ea-9dc2-8015095cf569.png;>

https://user-images.githubusercontent.com/13592258/81878658-854a3c80-953d-11ea-8dc9-217aa749fd00.png;>

https://user-images.githubusercontent.com/13592258/81878659-87ac9680-953d-11ea-8c6b-74ab76748e4a.png;>

https://user-images.githubusercontent.com/13592258/81878664-8b401d80-953d-11ea-9ee1-05f6677e263c.png;>

https://user-images.githubusercontent.com/13592258/81878669-8da27780-953d-11ea-8216-77eb8bb7e091.png;>

### How was this patch tested?
Manually build and check

Closes #28524 from huaxingao/examples.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
---
 docs/ml-features.md| 140 +
 docs/ml-statistics.md  |  56 -
 ...tExample.java => JavaANOVASelectorExample.java} |  48 +++
 .../spark/examples/ml/JavaANOVATestExample.java|   2 +-
 ...Example.java => JavaFValueSelectorExample.java} |  48 +++
 .../spark/examples/ml/JavaFValueTestExample.java   |   2 +-
 ...a_test_example.py => anova_selector_example.py} |  35 +++---
 examples/src/main/python/ml/anova_test_example.py  |   2 +-
 ..._test_example.py => fvalue_selector_example.py} |  35 +++---
 examples/src/main/python/ml/fvalue_test_example.py |   2 +-
 ...estExample.scala => ANOVASelectorExample.scala} |  42 ---
 .../spark/examples/ml/ANOVATestExample.scala   |   2 +-
 ...stExample.scala => FValueSelectorExample.scala} |  42 ---
 ...ueTestExample.scala => FValueTestExample.scala} |   0
 14 files changed, 340 insertions(+), 116 deletions(-)

diff --git a/docs/ml-features.md b/docs/ml-features.md
index 65b60be..660c272 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1793,6 +1793,146 @@ for more details on the API.
 
 
 
+## ANOVASelector
+
+`ANOVASelector` operates on categorical labels with continuous features. It 
uses the
+[one-way ANOVA 
F-test](https://en.wikipedia.org/wiki/F-test#Multiple-comparison_ANOVA_problems)
 to decide which
+features to choose.
+It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, 
`fdr`, `fwe`:
+* `numTopFeatures` chooses a fixed number of top features according to ANOVA 
F-test.
+* `percentile` is similar to `numTopFeatures` but chooses a fraction of all 
features instead of a fixed number.
+* `fpr` chooses all features whose p-values are below a threshold, thus 
controlling the false positive rate of selection.
+* `fdr` uses the [Benjamini-Hochberg 
procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
 to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose p-values a

[spark] branch branch-3.0 updated: [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary

2020-05-14 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 6834f46  [SPARK-31681][ML][PYSPARK] Python multiclass logistic 
regression evaluate should return LogisticRegressionSummary
6834f46 is described below

commit 6834f4691b3e2603d410bfe24f0db0b6e3a36446
Author: Huaxin Gao 
AuthorDate: Thu May 14 10:54:35 2020 -0500

[SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate 
should return LogisticRegressionSummary

### What changes were proposed in this pull request?
Return LogisticRegressionSummary for multiclass logistic regression 
evaluate in PySpark

### Why are the changes needed?
Currently we have
```
since("2.0.0")
def evaluate(self, dataset):
if not isinstance(dataset, DataFrame):
raise ValueError("dataset must be a DataFrame but got %s." % 
type(dataset))
java_blr_summary = self._call_java("evaluate", dataset)
return BinaryLogisticRegressionSummary(java_blr_summary)
```
we should return LogisticRegressionSummary for multiclass logistic 
regression

### Does this PR introduce _any_ user-facing change?
Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary 
for multiclass logistic regression in Python

### How was this patch tested?
unit test

Closes #28503 from huaxingao/lr_summary.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
(cherry picked from commit e10516ae63cfc58f2d493e4d3f19940d45c8f033)
Signed-off-by: Sean Owen 
---
 python/pyspark/ml/classification.py  | 5 -
 python/pyspark/ml/tests/test_training_summary.py | 6 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 1436b78..424e16a 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -831,7 +831,10 @@ class 
LogisticRegressionModel(JavaProbabilisticClassificationModel, _LogisticReg
 if not isinstance(dataset, DataFrame):
 raise ValueError("dataset must be a DataFrame but got %s." % 
type(dataset))
 java_blr_summary = self._call_java("evaluate", dataset)
-return BinaryLogisticRegressionSummary(java_blr_summary)
+if self.numClasses <= 2:
+return BinaryLogisticRegressionSummary(java_blr_summary)
+else:
+return LogisticRegressionSummary(java_blr_summary)
 
 
 class LogisticRegressionSummary(JavaWrapper):
diff --git a/python/pyspark/ml/tests/test_training_summary.py 
b/python/pyspark/ml/tests/test_training_summary.py
index 1d19ebf..b505409 100644
--- a/python/pyspark/ml/tests/test_training_summary.py
+++ b/python/pyspark/ml/tests/test_training_summary.py
@@ -21,7 +21,8 @@ import unittest
 if sys.version > '3':
 basestring = str
 
-from pyspark.ml.classification import LogisticRegression
+from pyspark.ml.classification import BinaryLogisticRegressionSummary, 
LogisticRegression, \
+LogisticRegressionSummary
 from pyspark.ml.clustering import BisectingKMeans, GaussianMixture, KMeans
 from pyspark.ml.linalg import Vectors
 from pyspark.ml.regression import GeneralizedLinearRegression, LinearRegression
@@ -149,6 +150,7 @@ class TrainingSummaryTest(SparkSessionTestCase):
 # test evaluation (with training dataset) produces a summary with same 
values
 # one check is enough to verify a summary is returned, Scala version 
runs full test
 sameSummary = model.evaluate(df)
+self.assertTrue(isinstance(sameSummary, 
BinaryLogisticRegressionSummary))
 self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC)
 
 def test_multiclass_logistic_regression_summary(self):
@@ -187,6 +189,8 @@ class TrainingSummaryTest(SparkSessionTestCase):
 # test evaluation (with training dataset) produces a summary with same 
values
 # one check is enough to verify a summary is returned, Scala version 
runs full test
 sameSummary = model.evaluate(df)
+self.assertTrue(isinstance(sameSummary, LogisticRegressionSummary))
+self.assertFalse(isinstance(sameSummary, 
BinaryLogisticRegressionSummary))
 self.assertAlmostEqual(sameSummary.accuracy, s.accuracy)
 
 def test_gaussian_mixture_summary(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate should return LogisticRegressionSummary

2020-05-14 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 6834f46  [SPARK-31681][ML][PYSPARK] Python multiclass logistic 
regression evaluate should return LogisticRegressionSummary
6834f46 is described below

commit 6834f4691b3e2603d410bfe24f0db0b6e3a36446
Author: Huaxin Gao 
AuthorDate: Thu May 14 10:54:35 2020 -0500

[SPARK-31681][ML][PYSPARK] Python multiclass logistic regression evaluate 
should return LogisticRegressionSummary

### What changes were proposed in this pull request?
Return LogisticRegressionSummary for multiclass logistic regression 
evaluate in PySpark

### Why are the changes needed?
Currently we have
```
since("2.0.0")
def evaluate(self, dataset):
if not isinstance(dataset, DataFrame):
raise ValueError("dataset must be a DataFrame but got %s." % 
type(dataset))
java_blr_summary = self._call_java("evaluate", dataset)
return BinaryLogisticRegressionSummary(java_blr_summary)
```
we should return LogisticRegressionSummary for multiclass logistic 
regression

### Does this PR introduce _any_ user-facing change?
Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary 
for multiclass logistic regression in Python

### How was this patch tested?
unit test

Closes #28503 from huaxingao/lr_summary.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
(cherry picked from commit e10516ae63cfc58f2d493e4d3f19940d45c8f033)
Signed-off-by: Sean Owen 
---
 python/pyspark/ml/classification.py  | 5 -
 python/pyspark/ml/tests/test_training_summary.py | 6 +-
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/ml/classification.py 
b/python/pyspark/ml/classification.py
index 1436b78..424e16a 100644
--- a/python/pyspark/ml/classification.py
+++ b/python/pyspark/ml/classification.py
@@ -831,7 +831,10 @@ class 
LogisticRegressionModel(JavaProbabilisticClassificationModel, _LogisticReg
 if not isinstance(dataset, DataFrame):
 raise ValueError("dataset must be a DataFrame but got %s." % 
type(dataset))
 java_blr_summary = self._call_java("evaluate", dataset)
-return BinaryLogisticRegressionSummary(java_blr_summary)
+if self.numClasses <= 2:
+return BinaryLogisticRegressionSummary(java_blr_summary)
+else:
+return LogisticRegressionSummary(java_blr_summary)
 
 
 class LogisticRegressionSummary(JavaWrapper):
diff --git a/python/pyspark/ml/tests/test_training_summary.py 
b/python/pyspark/ml/tests/test_training_summary.py
index 1d19ebf..b505409 100644
--- a/python/pyspark/ml/tests/test_training_summary.py
+++ b/python/pyspark/ml/tests/test_training_summary.py
@@ -21,7 +21,8 @@ import unittest
 if sys.version > '3':
 basestring = str
 
-from pyspark.ml.classification import LogisticRegression
+from pyspark.ml.classification import BinaryLogisticRegressionSummary, 
LogisticRegression, \
+LogisticRegressionSummary
 from pyspark.ml.clustering import BisectingKMeans, GaussianMixture, KMeans
 from pyspark.ml.linalg import Vectors
 from pyspark.ml.regression import GeneralizedLinearRegression, LinearRegression
@@ -149,6 +150,7 @@ class TrainingSummaryTest(SparkSessionTestCase):
 # test evaluation (with training dataset) produces a summary with same 
values
 # one check is enough to verify a summary is returned, Scala version 
runs full test
 sameSummary = model.evaluate(df)
+self.assertTrue(isinstance(sameSummary, 
BinaryLogisticRegressionSummary))
 self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC)
 
 def test_multiclass_logistic_regression_summary(self):
@@ -187,6 +189,8 @@ class TrainingSummaryTest(SparkSessionTestCase):
 # test evaluation (with training dataset) produces a summary with same 
values
 # one check is enough to verify a summary is returned, Scala version 
runs full test
 sameSummary = model.evaluate(df)
+self.assertTrue(isinstance(sameSummary, LogisticRegressionSummary))
+self.assertFalse(isinstance(sameSummary, 
BinaryLogisticRegressionSummary))
 self.assertAlmostEqual(sameSummary.accuracy, s.accuracy)
 
 def test_gaussian_mixture_summary(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b2300fc -> e10516a)

2020-05-14 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b2300fc  [SPARK-31676][ML] QuantileDiscretizer raise error parameter 
splits given invalid value (splits array includes -0.0 and 0.0)
 add e10516a  [SPARK-31681][ML][PYSPARK] Python multiclass logistic 
regression evaluate should return LogisticRegressionSummary

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/classification.py  | 5 -
 python/pyspark/ml/tests/test_training_summary.py | 6 +-
 2 files changed, 9 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b2300fc -> e10516a)

2020-05-14 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b2300fc  [SPARK-31676][ML] QuantileDiscretizer raise error parameter 
splits given invalid value (splits array includes -0.0 and 0.0)
 add e10516a  [SPARK-31681][ML][PYSPARK] Python multiclass logistic 
regression evaluate should return LogisticRegressionSummary

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/classification.py  | 5 -
 python/pyspark/ml/tests/test_training_summary.py | 6 +-
 2 files changed, 9 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0)

2020-05-14 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1ea5844  [SPARK-31676][ML] QuantileDiscretizer raise error parameter 
splits given invalid value (splits array includes -0.0 and 0.0)
1ea5844 is described below

commit 1ea584443e9372a6a0b3c8449f5bf7e9e1369b0d
Author: Weichen Xu 
AuthorDate: Thu May 14 09:24:40 2020 -0500

[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given 
invalid value (splits array includes -0.0 and 0.0)

In QuantileDiscretizer.getDistinctSplits, before invoking distinct, 
normalize all -0.0 and 0.0 to be 0.0
```
for (i <- 0 until splits.length) {
  if (splits(i) == -0.0) {
splits(i) = 0.0
  }
}
```
Fix bug.

No

Unit test.

~~~scala
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ 
Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)

import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")

import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new 
QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)

val model = qd.fit(df1) // will raise error in spark master.
~~~

scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. 
This break the contract between equals() and hashCode() If two objects are 
equal, then they must have the same hash code.

And array.distinct will rely on elem.hashCode so it leads to this error.

Test code on distinct
```
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ 
Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))
```

Then you will see output like:
```
...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...
```

Closes #28498 from WeichenXu123/SPARK-31676.

Authored-by: Weichen Xu 
Signed-off-by: Sean Owen 
(cherry picked from commit b2300fca1e1a22d74c6eeda37942920a6c6299ff)
Signed-off-by: Sean Owen 
---
 .../apache/spark/ml/feature/QuantileDiscretizer.scala  | 12 
 .../spark/ml/feature/QuantileDiscretizerSuite.scala| 18 ++
 2 files changed, 30 insertions(+)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 56e2c54..f3ec358 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -243,6 +243,18 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   private def getDistinctSplits(splits: Array[Double]): Array[Double] = {
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
+
+// 0.0 and -0.0 are distinct values, array.distinct will preserve both of 
them.
+// but 0.0 > -0.0 is False which will break the parameter validation 
checking.
+// and in scala <= 2.12, there's bug which will cause array.distinct 
generate
+// non-deterministic results when array contains both 0.0 and -0.0
+// So that here we should first normalize all 0.0 and -0.0 to be 0.0
+// See https://github.com/scala/bug/issues/11995
+for (i <- 0 until splits.length) {
+  if (splits(i) == -0.0) {
+splits(i) = 0.0
+  }
+}
 val distinctSplits = splits.distinct
 if (splits.length != distinctSplits.length) {
   log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
index b009038..9c37416 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
@@ -443,4 +443,22 @@ class QuantileDiscretizerSuite extends MLTest with 
DefaultReadWriteTest {
   discretizer.fit(df)
 }
   }
+
+  test("[SPARK-31676] QuantileDiscretizer raise error parameter splits given 
invalid value") {
+import scala.util.Random
+val rng = new Random(3)
+
+val a1 = Array.tabulate(200)(_ => rng.nextDouble * 2.0 - 1.0) ++
+  Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
+
+val

[spark] branch branch-3.0 updated: [SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given invalid value (splits array includes -0.0 and 0.0)

2020-05-14 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 00e6acc  [SPARK-31676][ML] QuantileDiscretizer raise error parameter 
splits given invalid value (splits array includes -0.0 and 0.0)
00e6acc is described below

commit 00e6acc9c6d45c5dd3b3f70c87909743a8073dba
Author: Weichen Xu 
AuthorDate: Thu May 14 09:24:40 2020 -0500

[SPARK-31676][ML] QuantileDiscretizer raise error parameter splits given 
invalid value (splits array includes -0.0 and 0.0)

### What changes were proposed in this pull request?

In QuantileDiscretizer.getDistinctSplits, before invoking distinct, 
normalize all -0.0 and 0.0 to be 0.0
```
for (i <- 0 until splits.length) {
  if (splits(i) == -0.0) {
splits(i) = 0.0
  }
}
```
### Why are the changes needed?
Fix bug.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test.

 Manually test:

~~~scala
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ 
Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)

import spark.implicits._
val df1 = sc.parallelize(a1, 2).toDF("id")

import org.apache.spark.ml.feature.QuantileDiscretizer
val qd = new 
QuantileDiscretizer().setInputCol("id").setOutputCol("out").setNumBuckets(200).setRelativeError(0.0)

val model = qd.fit(df1) // will raise error in spark master.
~~~

### Explain
scala `0.0 == -0.0` is True but `0.0.hashCode == -0.0.hashCode()` is False. 
This break the contract between equals() and hashCode() If two objects are 
equal, then they must have the same hash code.

And array.distinct will rely on elem.hashCode so it leads to this error.

Test code on distinct
```
import scala.util.Random
val rng = new Random(3)

val a1 = Array.tabulate(200)(_=>rng.nextDouble * 2.0 - 1.0) ++ 
Array.fill(20)(0.0) ++ Array.fill(20)(-0.0)
a1.distinct.sorted.foreach(x => print(x.toString + "\n"))
```

Then you will see output like:
```
...
-0.009292684662246975
-0.0033280686465135823
-0.0
0.0
0.0022219556032221366
0.02217419561977274
...
```

Closes #28498 from WeichenXu123/SPARK-31676.

Authored-by: Weichen Xu 
Signed-off-by: Sean Owen 
(cherry picked from commit b2300fca1e1a22d74c6eeda37942920a6c6299ff)
Signed-off-by: Sean Owen 
---
 .../apache/spark/ml/feature/QuantileDiscretizer.scala  | 12 
 .../spark/ml/feature/QuantileDiscretizerSuite.scala| 18 ++
 2 files changed, 30 insertions(+)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 216d99d..4eedfc4 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -236,6 +236,18 @@ final class QuantileDiscretizer @Since("1.6.0") 
(@Since("1.6.0") override val ui
   private def getDistinctSplits(splits: Array[Double]): Array[Double] = {
 splits(0) = Double.NegativeInfinity
 splits(splits.length - 1) = Double.PositiveInfinity
+
+// 0.0 and -0.0 are distinct values, array.distinct will preserve both of 
them.
+// but 0.0 > -0.0 is False which will break the parameter validation 
checking.
+// and in scala <= 2.12, there's bug which will cause array.distinct 
generate
+// non-deterministic results when array contains both 0.0 and -0.0
+// So that here we should first normalize all 0.0 and -0.0 to be 0.0
+// See https://github.com/scala/bug/issues/11995
+for (i <- 0 until splits.length) {
+  if (splits(i) == -0.0) {
+splits(i) = 0.0
+  }
+}
 val distinctSplits = splits.distinct
 if (splits.length != distinctSplits.length) {
   log.warn(s"Some quantiles were identical. Bucketing to 
${distinctSplits.length - 1}" +
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
index 6f6ab26..682b87a 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
@@ -512,4 +512,22 @@ class QuantileDiscretizerSuite extends MLTest with 
DefaultReadWriteTest {
 assert(observedNumBuckets === numBuckets,
   "Observed number of buckets does not equal expected number of

[spark] branch master updated (ddbce4e -> b2300fc)

2020-05-14 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ddbce4e  [SPARK-30973][SQL] ScriptTransformationExec should wait for 
the termination …
 add b2300fc  [SPARK-31676][ML] QuantileDiscretizer raise error parameter 
splits given invalid value (splits array includes -0.0 and 0.0)

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/ml/feature/QuantileDiscretizer.scala  | 12 
 .../spark/ml/feature/QuantileDiscretizerSuite.scala| 18 ++
 2 files changed, 30 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

2020-05-12 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 59d9099  [MINOR][DOCS] Mention lack of RDD order preservation after 
deserialization
59d9099 is described below

commit 59d90997a52f78450fefbc96beba1d731b3678a1
Author: Antonin Delpeuch 
AuthorDate: Tue May 12 08:27:43 2020 -0500

[MINOR][DOCS] Mention lack of RDD order preservation after deserialization

### What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not 
guaranteed when saving a RDD to disk and reading it back 
([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:

http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

### Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I 
use `saveAsTextFile` and then load the resulting file with 
`sparkContext.textFile`, I obtain a RDD in the same order.

This is unfortunately not the case at the moment and there is no agreed 
upon way to fix this in Spark itself (see PR #4204 which attempted to fix 
this). Users should be aware of this.

### Does this PR introduce _any_ user-facing change?

Yes, two new sentences in the documentation.

### How was this patch tested?

By checking that the documentation looks good.

Closes #28465 from wetneb/SPARK-5300-docs.

Authored-by: Antonin Delpeuch 
Signed-off-by: Sean Owen 
---
 docs/rdd-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index ba99007..70bfefc 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
 
 * If using a path on the local filesystem, the file must also be accessible at 
the same path on worker nodes. Either copy the file to all workers or use a 
network-mounted shared file system.
 
-* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`. When multiple files are read, the order of 
the partitions depends on the order the files are returned from the filesystem. 
It may or may not, for example, follow the lexicographic ordering of the files 
by path. Within a partiti [...]
 
 * The `textFile` method also takes an optional second argument for controlling 
the number of partitions of the file. By default, Spark creates one partition 
for each block of the file (blocks being 128MB by default in HDFS), but you can 
also ask for a higher number of partitions by passing a larger value. Note that 
you cannot have fewer partitions than blocks.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOCS] Mention lack of RDD order preservation after deserialization

2020-05-12 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 59d9099  [MINOR][DOCS] Mention lack of RDD order preservation after 
deserialization
59d9099 is described below

commit 59d90997a52f78450fefbc96beba1d731b3678a1
Author: Antonin Delpeuch 
AuthorDate: Tue May 12 08:27:43 2020 -0500

[MINOR][DOCS] Mention lack of RDD order preservation after deserialization

### What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not 
guaranteed when saving a RDD to disk and reading it back 
([SPARK-5300](https://issues.apache.org/jira/browse/SPARK-5300)).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:

http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

### Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I 
use `saveAsTextFile` and then load the resulting file with 
`sparkContext.textFile`, I obtain a RDD in the same order.

This is unfortunately not the case at the moment and there is no agreed 
upon way to fix this in Spark itself (see PR #4204 which attempted to fix 
this). Users should be aware of this.

### Does this PR introduce _any_ user-facing change?

Yes, two new sentences in the documentation.

### How was this patch tested?

By checking that the documentation looks good.

Closes #28465 from wetneb/SPARK-5300-docs.

Authored-by: Antonin Delpeuch 
Signed-off-by: Sean Owen 
---
 docs/rdd-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index ba99007..70bfefc 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -360,7 +360,7 @@ Some notes on reading files with Spark:
 
 * If using a path on the local filesystem, the file must also be accessible at 
the same path on worker nodes. Either copy the file to all workers or use a 
network-mounted shared file system.
 
-* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support 
running on directories, compressed files, and wildcards as well. For example, 
you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and 
`textFile("/my/directory/*.gz")`. When multiple files are read, the order of 
the partitions depends on the order the files are returned from the filesystem. 
It may or may not, for example, follow the lexicographic ordering of the files 
by path. Within a partiti [...]
 
 * The `textFile` method also takes an optional second argument for controlling 
the number of partitions of the file. By default, Spark creates one partition 
for each block of the file (blocks being 128MB by default in HDFS), but you can 
also ask for a higher number of partitions by passing a larger value. Note that 
you cannot have fewer partitions than blocks.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1f85cd7  [SPARK-31671][ML] Wrong error message in VectorAssembler
1f85cd7 is described below

commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 9192e72..994681a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -228,7 +228,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol(&quo

[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1f85cd7  [SPARK-31671][ML] Wrong error message in VectorAssembler
1f85cd7 is described below

commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 9192e72..994681a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -228,7 +228,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol(&quo

[spark] branch branch-3.0 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 7e226a2  [SPARK-31671][ML] Wrong error message in VectorAssembler
7e226a2 is described below

commit 7e226a25efeaf083c95f04ee0d9c3a6e5b6d763d
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 3070012..7bc5e56 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -233,7 +233,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol(&quo

[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1f85cd7  [SPARK-31671][ML] Wrong error message in VectorAssembler
1f85cd7 is described below

commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 9192e72..994681a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -228,7 +228,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol(&quo

[spark] branch branch-3.0 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 7e226a2  [SPARK-31671][ML] Wrong error message in VectorAssembler
7e226a2 is described below

commit 7e226a25efeaf083c95f04ee0d9c3a6e5b6d763d
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 3070012..7bc5e56 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -233,7 +233,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol(&quo

[spark] branch master updated (d7c3e9e -> 64fb358)

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d7c3e9e  [SPARK-31456][CORE] Fix shutdown hook priority edge cases
 add 64fb358  [SPARK-31671][ML] Wrong error message in VectorAssembler

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 1f85cd7  [SPARK-31671][ML] Wrong error message in VectorAssembler
1f85cd7 is described below

commit 1f85cd7504623b9b4e7957aab5856f72e981cbd9
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 9192e72..994681a 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -228,7 +228,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol(&quo

[spark] branch branch-3.0 updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 7e226a2  [SPARK-31671][ML] Wrong error message in VectorAssembler
7e226a2 is described below

commit 7e226a25efeaf083c95f04ee0d9c3a6e5b6d763d
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
(cherry picked from commit 64fb358a994d3fff651a742fa067c194b7455853)
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 3070012..7bc5e56 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -233,7 +233,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol(&quo

[spark] branch master updated (d7c3e9e -> 64fb358)

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from d7c3e9e  [SPARK-31456][CORE] Fix shutdown hook priority edge cases
 add 64fb358  [SPARK-31671][ML] Wrong error message in VectorAssembler

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31671][ML] Wrong error message in VectorAssembler

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 64fb358  [SPARK-31671][ML] Wrong error message in VectorAssembler
64fb358 is described below

commit 64fb358a994d3fff651a742fa067c194b7455853
Author: fan31415 
AuthorDate: Mon May 11 18:23:23 2020 -0500

[SPARK-31671][ML] Wrong error message in VectorAssembler

### What changes were proposed in this pull request?
When input column lengths can not be inferred and handleInvalid = "keep",  
VectorAssembler will throw a runtime exception. However the error message with 
this exception is not consistent. I change the content of this error message to 
make it work properly.

### Why are the changes needed?
This is a bug. Here is a simple example to reproduce it.

```
// create a df without vector size
val df = Seq(
  (Vectors.dense(1.0), Vectors.dense(2.0))
).toDF("n1", "n2")

// only set vector size hint for n1 column
val hintedDf = new VectorSizeHint()
  .setInputCol("n1")
  .setSize(1)
  .transform(df)

// assemble n1, n2
val output = new VectorAssembler()
  .setInputCols(Array("n1", "n2"))
  .setOutputCol("features")
  .setHandleInvalid("keep")
  .transform(hintedDf)

// because only n1 has vector size, the error message should tell us to set 
vector size for n2 too
output.show()
```

Expected error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n2].
```

Actual error message:

```
Can not infer column lengths with handleInvalid = "keep". Consider using 
VectorSizeHint to add metadata for columns: [n1, n2].
```

This introduce difficulties when I try to resolve this exception, for I do 
not know which column required vectorSizeHint. This is especially troublesome 
when you have a large number of columns to deal with.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add test in VectorAssemblerSuite.

Closes #28487 from fan31415/SPARK-31671.

Lead-authored-by: fan31415 
Co-authored-by: yijiefan 
Signed-off-by: Sean Owen 
---
 .../scala/org/apache/spark/ml/feature/VectorAssembler.scala   |  2 +-
 .../org/apache/spark/ml/feature/VectorAssemblerSuite.scala| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
index 3070012..7bc5e56 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala
@@ -233,7 +233,7 @@ object VectorAssembler extends 
DefaultParamsReadable[VectorAssembler] {
 getVectorLengthsFromFirstRow(dataset.na.drop(missingColumns), 
missingColumns)
   case (true, VectorAssembler.KEEP_INVALID) => throw new RuntimeException(
 s"""Can not infer column lengths with handleInvalid = "keep". Consider 
using VectorSizeHint
-   |to add metadata for columns: ${columns.mkString("[", ", ", 
"]")}."""
+   |to add metadata for columns: ${missingColumns.mkString("[", ", ", 
"]")}."""
   .stripMargin.replaceAll("\n", " "))
   case (_, _) => Map.empty
 }
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
index a4d388f..4957f6f 100644
--- 
a/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala
@@ -261,4 +261,15 @@ class VectorAssemblerSuite
 val output = vectorAssembler.transform(dfWithNullsAndNaNs)
 assert(output.select("a").limit(1).collect().head == Row(Vectors.sparse(0, 
Seq.empty)))
   }
+
+  test("SPARK-31671: should give explicit error message when can not infer 
column lengths") {
+val df = Seq(
+  (Vectors.dense(1.0), Vectors.dense(2.0))
+).toDF("n1", "n2")
+val hintedDf = new 
VectorSizeHint().setInputCol("n1").setSize(1).transform(df)
+val assembler = new VectorAssembler()
+  .setInputCols(Array("n1", "n2")).setOutputCol("features")
+
assert(!intercept[RuntimeException](assembler.setHandleInvalid("keep").transform(hintedDf))
+  .getMessage.contains("n1"), "should only show no vector size columns' 
name")
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (32a5398 -> 7a670b5)

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 32a5398  [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding 
of random dates/timestamps
 add 7a670b5  [SPARK-31667][ML][PYSPARK] Python side flatten the result 
dataframe of ANOVATest/ChisqTest/FValueTest

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/stat.py | 60 +--
 1 file changed, 48 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (32a5398 -> 7a670b5)

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 32a5398  [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding 
of random dates/timestamps
 add 7a670b5  [SPARK-31667][ML][PYSPARK] Python side flatten the result 
dataframe of ANOVATest/ChisqTest/FValueTest

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/stat.py | 60 +--
 1 file changed, 48 insertions(+), 12 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7a670b5 -> 5a5af46)

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7a670b5  [SPARK-31667][ML][PYSPARK] Python side flatten the result 
dataframe of ANOVATest/ChisqTest/FValueTest
 add 5a5af46  [SPARK-31575][SQL] Synchronise global JVM security 
configuration modification

No new revisions were added by this update.

Summary of changes:
 .../jdbc/connection/DB2ConnectionProvider.scala|  2 +-
 .../connection/MariaDBConnectionProvider.scala |  2 +-
 .../connection/PostgresConnectionProvider.scala|  2 +-
 .../jdbc/connection/SecureConnectionProvider.scala |  9 -
 .../jdbc/connection/ConnectionProviderSuite.scala  | 45 ++
 5 files changed, 56 insertions(+), 4 deletions(-)
 create mode 100644 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProviderSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (7a670b5 -> 5a5af46)

2020-05-11 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 7a670b5  [SPARK-31667][ML][PYSPARK] Python side flatten the result 
dataframe of ANOVATest/ChisqTest/FValueTest
 add 5a5af46  [SPARK-31575][SQL] Synchronise global JVM security 
configuration modification

No new revisions were added by this update.

Summary of changes:
 .../jdbc/connection/DB2ConnectionProvider.scala|  2 +-
 .../connection/MariaDBConnectionProvider.scala |  2 +-
 .../connection/PostgresConnectionProvider.scala|  2 +-
 .../jdbc/connection/SecureConnectionProvider.scala |  9 -
 .../jdbc/connection/ConnectionProviderSuite.scala  | 45 ++
 5 files changed, 56 insertions(+), 4 deletions(-)
 create mode 100644 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/ConnectionProviderSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (a75dc80 -> 9f768fa)

2020-05-10 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a75dc80  [SPARK-31636][SQL][DOCS] Remove HTML syntax in SQL reference
 add 9f768fa  [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on 
non-existing dates/timestamps

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/RandomDataGenerator.scala | 23 +++---
 1 file changed, 20 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps

2020-05-10 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 6f7c719  [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on 
non-existing dates/timestamps
6f7c719 is described below

commit 6f7c71947073f147bc35da196139d5ceb6fbdf45
Author: Max Gekk 
AuthorDate: Sun May 10 14:22:12 2020 -0500

[SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing 
dates/timestamps

### What changes were proposed in this pull request?
Shift non-existing dates in Proleptic Gregorian calendar by 1 day. The 
reason for that is `RowEncoderSuite` generates random dates/timestamps in the 
hybrid calendar, and some dates/timestamps don't exist in Proleptic Gregorian 
calendar like 1000-02-29 because 1000 is not leap year in Proleptic Gregorian 
calendar.

### Why are the changes needed?
This makes RowEncoderSuite much stable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running RowEncoderSuite and set non-existing date manually:
```scala
val date = new java.sql.Date(1000 - 1900, 1, 29)
Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + 
MILLIS_PER_DAY))
```

Closes #28486 from MaxGekk/fix-RowEncoderSuite.

Authored-by: Max Gekk 
Signed-off-by: Sean Owen 
(cherry picked from commit 9f768fa9916dec3cc695e3f28ec77148d81d335f)
Signed-off-by: Sean Owen 
---
 .../org/apache/spark/sql/RandomDataGenerator.scala | 23 +++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
index a7c20c3..5a4d23d 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/RandomDataGenerator.scala
@@ -18,9 +18,10 @@
 package org.apache.spark.sql
 
 import java.math.MathContext
+import java.sql.{Date, Timestamp}
 
 import scala.collection.mutable
-import scala.util.Random
+import scala.util.{Random, Try}
 
 import org.apache.spark.sql.catalyst.CatalystTypeConverters
 import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY
@@ -172,7 +173,15 @@ object RandomDataGenerator {
   // January 1, 1970, 00:00:00 GMT for "-12-31 
23:59:59.99".
   milliseconds = rand.nextLong() % 25340232959L
 }
-DateTimeUtils.toJavaDate((milliseconds / MILLIS_PER_DAY).toInt)
+val date = DateTimeUtils.toJavaDate((milliseconds / 
MILLIS_PER_DAY).toInt)
+// The generated `date` is based on the hybrid calendar Julian + 
Gregorian since
+// 1582-10-15 but it should be valid in Proleptic Gregorian 
calendar too which is used
+// by Spark SQL since version 3.0 (see SPARK-26651). We try to 
convert `date` to
+// a local date in Proleptic Gregorian calendar to satisfy this 
requirement.
+// Some years are leap years in Julian calendar but not in 
Proleptic Gregorian calendar.
+// As the consequence of that, 29 February of such years might not 
exist in Proleptic
+// Gregorian calendar. When this happens, we shift the date by one 
day.
+Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + 
MILLIS_PER_DAY))
   }
 Some(generator)
   case TimestampType =>
@@ -188,7 +197,15 @@ object RandomDataGenerator {
   milliseconds = rand.nextLong() % 25340232959L
 }
 // DateTimeUtils.toJavaTimestamp takes microsecond.
-DateTimeUtils.toJavaTimestamp(milliseconds * 1000)
+val ts = DateTimeUtils.toJavaTimestamp(milliseconds * 1000)
+// The generated `ts` is based on the hybrid calendar Julian + 
Gregorian since
+// 1582-10-15 but it should be valid in Proleptic Gregorian 
calendar too which is used
+// by Spark SQL since version 3.0 (see SPARK-26651). We try to 
convert `ts` to
+// a local timestamp in Proleptic Gregorian calendar to satisfy 
this requirement.
+// Some years are leap years in Julian calendar but not in 
Proleptic Gregorian calendar.
+// As the consequence of that, 29 February of such years might not 
exist in Proleptic
+// Gregorian calendar. When this happens, we shift the timestamp 
`ts` by one day.
+Try { ts.toLocalDateTime; ts }.getOrElse(new Timestamp(ts.getTime 
+ MILLIS_PER_DAY))
   }
 Some(generator)
   case CalendarIntervalType => Some(() => {


-
To unsubscribe, e-mai

[spark] branch master updated (ce63bef -> a75dc80)

2020-05-10 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from ce63bef  [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 
from dictionary encoded Parquet columns
 add a75dc80  [SPARK-31636][SQL][DOCS] Remove HTML syntax in SQL reference

No new revisions were added by this update.

Summary of changes:
 docs/_data/menu-sql.yaml   |  20 +-
 docs/sql-ref-ansi-compliance.md|  18 +-
 docs/sql-ref-datatypes.md  |   4 +-
 docs/sql-ref-functions-builtin.md  |   2 +-
 docs/sql-ref-functions-udf-aggregate.md| 101 
 docs/sql-ref-functions-udf-hive.md |  12 +-
 docs/sql-ref-functions-udf-scalar.md   |  28 +-
 docs/sql-ref-identifier.md |  37 ++-
 docs/sql-ref-literals.md   | 282 +
 docs/sql-ref-null-semantics.md |  44 ++--
 docs/sql-ref-syntax-aux-analyze-table.md   |  64 ++---
 docs/sql-ref-syntax-aux-cache-cache-table.md   |  98 +++
 docs/sql-ref-syntax-aux-cache-clear-cache.md   |  16 +-
 docs/sql-ref-syntax-aux-cache-refresh.md   |  24 +-
 docs/sql-ref-syntax-aux-cache-uncache-table.md |  31 +--
 docs/sql-ref-syntax-aux-conf-mgmt-reset.md |  10 +-
 docs/sql-ref-syntax-aux-conf-mgmt-set.md   |  31 +--
 docs/sql-ref-syntax-aux-describe-database.md   |  21 +-
 docs/sql-ref-syntax-aux-describe-function.md   |  30 +--
 docs/sql-ref-syntax-aux-describe-query.md  |  44 ++--
 docs/sql-ref-syntax-aux-describe-table.md  |  62 ++---
 docs/sql-ref-syntax-aux-refresh-table.md   |  31 +--
 docs/sql-ref-syntax-aux-resource-mgmt-add-file.md  |  21 +-
 docs/sql-ref-syntax-aux-resource-mgmt-add-jar.md   |  21 +-
 docs/sql-ref-syntax-aux-resource-mgmt-list-file.md |  14 +-
 docs/sql-ref-syntax-aux-resource-mgmt-list-jar.md  |  14 +-
 docs/sql-ref-syntax-aux-show-columns.md|   2 +-
 docs/sql-ref-syntax-aux-show-create-table.md   |  27 +-
 docs/sql-ref-syntax-aux-show-databases.md  |  32 +--
 docs/sql-ref-syntax-aux-show-functions.md  |  60 ++---
 docs/sql-ref-syntax-aux-show-partitions.md |  47 ++--
 docs/sql-ref-syntax-aux-show-table.md  |  60 ++---
 docs/sql-ref-syntax-aux-show-tables.md |  41 ++-
 docs/sql-ref-syntax-aux-show-tblproperties.md  |  51 ++--
 docs/sql-ref-syntax-aux-show-views.md  |  45 ++--
 docs/sql-ref-syntax-aux-show.md|   4 +-
 docs/sql-ref-syntax-ddl-alter-database.md  |  17 +-
 docs/sql-ref-syntax-ddl-alter-table.md | 256 ---
 docs/sql-ref-syntax-ddl-alter-view.md  | 124 -
 docs/sql-ref-syntax-ddl-create-database.md |  39 +--
 docs/sql-ref-syntax-ddl-create-function.md |  85 +++
 docs/sql-ref-syntax-ddl-create-table-datasource.md | 100 
 docs/sql-ref-syntax-ddl-create-table-hiveformat.md |  99 
 docs/sql-ref-syntax-ddl-create-table-like.md   |  73 +++---
 docs/sql-ref-syntax-ddl-create-table.md|  10 +-
 docs/sql-ref-syntax-ddl-create-view.md |  82 +++---
 docs/sql-ref-syntax-ddl-drop-database.md   |  42 ++-
 docs/sql-ref-syntax-ddl-drop-function.md   |  55 ++--
 docs/sql-ref-syntax-ddl-drop-table.md  |  45 ++--
 docs/sql-ref-syntax-ddl-drop-view.md   |  49 ++--
 docs/sql-ref-syntax-ddl-repair-table.md|  25 +-
 docs/sql-ref-syntax-ddl-truncate-table.md  |  43 ++--
 docs/sql-ref-syntax-dml-insert-into.md |  90 +++
 ...f-syntax-dml-insert-overwrite-directory-hive.md |  75 +++---
 ...ql-ref-syntax-dml-insert-overwrite-directory.md |  74 +++---
 docs/sql-ref-syntax-dml-insert-overwrite-table.md  |  87 +++
 docs/sql-ref-syntax-dml-insert.md  |   8 +-
 docs/sql-ref-syntax-dml-load.md|  67 ++---
 docs/sql-ref-syntax-dml.md |   4 +-
 docs/sql-ref-syntax-qry-explain.md |  58 ++---
 docs/sql-ref-syntax-qry-sampling.md|  20 +-
 docs/sql-ref-syntax-qry-select-clusterby.md|  33 ++-
 docs/sql-ref-syntax-qry-select-cte.md  |  35 ++-
 docs/sql-ref-syntax-qry-select-distribute-by.md|  33 ++-
 docs/sql-ref-syntax-qry-select-groupby.md  | 261 ++-
 docs/sql-ref-syntax-qry-select-having.md   |  54 ++--
 docs/sql-ref-syntax-qry-select-hints.md|  56 ++--
 docs/sql-ref-syntax-qry-select-inline-table.md |  35 +--
 docs/sql-ref-syntax-qry-select-join.md | 185 ++
 docs/sql-ref-syntax-qry-select-like.md |  51 ++--
 docs/sql-ref-syntax-qry-select-limit.md|  41 ++-
 docs/sql-ref-syntax-qry-select-orderby.md

[spark] branch master updated (b16ea8e -> 09ece50)

2020-05-06 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b16ea8e  [SPARK-31650][SQL] Fix wrong UI in case of 
AdaptiveSparkPlanExec has unmanaged subqueries
 add 09ece50  [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to 
PySpark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/feature.py | 142 +++
 1 file changed, 142 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b16ea8e -> 09ece50)

2020-05-06 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b16ea8e  [SPARK-31650][SQL] Fix wrong UI in case of 
AdaptiveSparkPlanExec has unmanaged subqueries
 add 09ece50  [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to 
PySpark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/ml/feature.py | 142 +++
 1 file changed, 142 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark

2020-05-06 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 09ece50  [SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to 
PySpark
09ece50 is described below

commit 09ece50799222d577009a2bbd480304d1ae1e14e
Author: Huaxin Gao 
AuthorDate: Wed May 6 09:11:03 2020 -0500

[SPARK-31609][ML][PYSPARK] Add VarianceThresholdSelector to PySpark

### What changes were proposed in this pull request?
Add VarianceThresholdSelector to PySpark

### Why are the changes needed?
parity between Scala and Python

### Does this PR introduce any user-facing change?
Yes.
VarianceThresholdSelector is added to PySpark

### How was this patch tested?
new doctest

Closes #28409 from huaxingao/variance_py.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
---
 python/pyspark/ml/feature.py | 142 +++
 1 file changed, 142 insertions(+)

diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 6df2f74..7acf8ce 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -57,6 +57,7 @@ __all__ = ['Binarizer',
'StopWordsRemover',
'StringIndexer', 'StringIndexerModel',
'Tokenizer',
+   'VarianceThresholdSelector', 'VarianceThresholdSelectorModel',
'VectorAssembler',
'VectorIndexer', 'VectorIndexerModel',
'VectorSizeHint',
@@ -5381,6 +5382,147 @@ class VectorSizeHint(JavaTransformer, HasInputCol, 
HasHandleInvalid, JavaMLReada
 return self._set(handleInvalid=value)
 
 
+class _VarianceThresholdSelectorParams(HasFeaturesCol, HasOutputCol):
+"""
+Params for :py:class:`VarianceThresholdSelector` and
+:py:class:`VarianceThresholdSelectorrModel`.
+
+.. versionadded:: 3.1.0
+"""
+
+varianceThreshold = Param(Params._dummy(), "varianceThreshold",
+  "Param for variance threshold. Features with a 
variance not " +
+  "greater than this threshold will be removed. 
The default value " +
+  "is 0.0.", typeConverter=TypeConverters.toFloat)
+
+@since("3.1.0")
+def getVarianceThreshold(self):
+"""
+Gets the value of varianceThreshold or its default value.
+"""
+return self.getOrDefault(self.varianceThreshold)
+
+
+@inherit_doc
+class VarianceThresholdSelector(JavaEstimator, 
_VarianceThresholdSelectorParams, JavaMLReadable,
+JavaMLWritable):
+"""
+Feature selector that removes all low-variance features. Features with a
+variance not greater than the threshold will be removed. The default is to 
keep
+all features with non-zero variance, i.e. remove the features that have the
+same value in all samples.
+
+>>> from pyspark.ml.linalg import Vectors
+>>> df = spark.createDataFrame(
+...[(Vectors.dense([6.0, 7.0, 0.0, 7.0, 6.0, 0.0]),),
+... (Vectors.dense([0.0, 9.0, 6.0, 0.0, 5.0, 9.0]),),
+... (Vectors.dense([0.0, 9.0, 3.0, 0.0, 5.0, 5.0]),),
+... (Vectors.dense([0.0, 9.0, 8.0, 5.0, 6.0, 4.0]),),
+... (Vectors.dense([8.0, 9.0, 6.0, 5.0, 4.0, 4.0]),),
+... (Vectors.dense([8.0, 9.0, 6.0, 0.0, 0.0, 0.0]),)],
+...["features"])
+>>> selector = VarianceThresholdSelector(varianceThreshold=8.2, 
outputCol="selectedFeatures")
+>>> model = selector.fit(df)
+>>> model.getFeaturesCol()
+'features'
+>>> model.setFeaturesCol("features")
+VarianceThresholdSelectorModel...
+>>> model.transform(df).head().selectedFeatures
+DenseVector([6.0, 7.0, 0.0])
+>>> model.selectedFeatures
+[0, 3, 5]
+>>> varianceThresholdSelectorPath = temp_path + 
"/variance-threshold-selector"
+>>> selector.save(varianceThresholdSelectorPath)
+>>> loadedSelector = 
VarianceThresholdSelector.load(varianceThresholdSelectorPath)
+>>> loadedSelector.getVarianceThreshold() == 
selector.getVarianceThreshold()
+True
+>>> modelPath = temp_path + "/variance-threshold-selector-model"
+>>> model.save(modelPath)
+>>> loadedModel = VarianceThresholdSelectorModel.load(modelPath)
+>>> loadedModel.selectedFeatures == model.selectedFeatures
+True
+
+.. versionadded:: 3.1.0
+"""
+
+@keyword_only
+def __init__(self, featuresCol="features", outputCol=None, 
varianceThreshold=0.0):
+

[spark] branch master updated (5052d95 -> 701deac)

2020-05-05 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 5052d95  [SPARK-31030][DOCS][FOLLOWUP] Replace HTML Table by Markdown 
Table
 add 701deac  [SPARK-31603][ML] AFT uses common functions in RDDLossFunction

No new revisions were added by this update.

Summary of changes:
 .../spark/ml/optim/aggregator/AFTAggregator.scala  | 162 +++
 .../aggregator/DifferentiableLossAggregator.scala  |   9 +-
 .../ml/regression/AFTSurvivalRegression.scala  | 228 +
 3 files changed, 173 insertions(+), 226 deletions(-)
 create mode 100644 
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31603][ML] AFT uses common functions in RDDLossFunction

2020-05-05 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 701deac  [SPARK-31603][ML] AFT uses common functions in RDDLossFunction
701deac is described below

commit 701deac88d09690ddf9d28b9c79814aecfd3277d
Author: zhengruifeng 
AuthorDate: Tue May 5 08:35:20 2020 -0500

[SPARK-31603][ML] AFT uses common functions in RDDLossFunction

### What changes were proposed in this pull request?
1, make AFT reuse common functions in `ml.optim`, rather than making its 
own impl.

### Why are the changes needed?
The logic in optimizing AFT is quite similar to other algorithms like other 
algs based on `RDDLossFunction`,
We should reuse the common functions.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing testsuites

Closes #28404 from zhengruifeng/mv_aft_optim.

Authored-by: zhengruifeng 
Signed-off-by: Sean Owen 
---
 .../spark/ml/optim/aggregator/AFTAggregator.scala  | 162 +++
 .../aggregator/DifferentiableLossAggregator.scala  |   9 +-
 .../ml/regression/AFTSurvivalRegression.scala  | 228 +
 3 files changed, 173 insertions(+), 226 deletions(-)

diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala 
b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala
new file mode 100644
index 000..6482c61
--- /dev/null
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/AFTAggregator.scala
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.optim.aggregator
+
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg._
+import org.apache.spark.ml.regression.AFTPoint
+
+/**
+ * AFTAggregator computes the gradient and loss for a AFT loss function,
+ * as used in AFT survival regression for samples in sparse or dense vector in 
an online fashion.
+ *
+ * The loss function and likelihood function under the AFT model based on:
+ * Lawless, J. F., Statistical Models and Methods for Lifetime Data,
+ * New York: John Wiley & Sons, Inc. 2003.
+ *
+ * Two AFTAggregator can be merged together to have a summary of loss and 
gradient of
+ * the corresponding joint dataset.
+ *
+ * Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
subjects i = 1,..,n,
+ * with possible right-censoring, the likelihood function under the AFT model 
is given as
+ *
+ * 
+ *$$
+ *L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}
+ *  (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}
+ *(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
+ *$$
+ * 
+ *
+ * Where $\delta_{i}$ is the indicator of the event has occurred i.e. 
uncensored or not.
+ * Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the 
log-likelihood function
+ * assumes the form
+ *
+ * 
+ *$$
+ *\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+
+ *
\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
+ *$$
+ * 
+ * Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
+ * and $f_{0}(\epsilon_{i})$ is corresponding density function.
+ *
+ * The most commonly used log-linear survival regression method is based on 
the Weibull
+ * distribution of the survival time. The Weibull distribution for lifetime 
corresponding
+ * to extreme value distribution for log of the lifetime,
+ * and the $S_{0}(\epsilon)$ function is
+ *
+ * 
+ *$$
+ *S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
+ *$$
+ * 
+ *
+ * and the $f_{0}(\epsilon_{i})$ function is
+ *
+ * 
+ *$$
+ *f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
+ *$$
+ * 
+ *
+ * The log-likelihood function for Weibull distribution of lifetime is
+ *
+ * 
+ *$$
+ *\iota(\beta,\sigma)=
+ *
-\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
+ *$$
+ * 
+ *
+ * Due to minimizing the negative l

[spark] branch master updated: [SPARK-31307][ML][EXAMPLES] Add examples for ml.fvalue

2020-05-01 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 348fd53  [SPARK-31307][ML][EXAMPLES] Add examples for ml.fvalue
348fd53 is described below

commit 348fd53214ccc476bee37e3ddd6b075a53886104
Author: Qianyang Yu 
AuthorDate: Fri May 1 09:16:08 2020 -0500

[SPARK-31307][ML][EXAMPLES] Add examples for ml.fvalue

### What changes were proposed in this pull request?

Add FValue example for ml.stat.FValueTest in python/java/scala

### Why are the changes needed?

Improve ML example

### Does this PR introduce any user-facing change?

No
### How was this patch tested?

manually run the example

Closes #28400 from kevinyu98/spark-26111-fvalue-examples.

Authored-by: Qianyang Yu 
Signed-off-by: Sean Owen 
---
 .../spark/examples/ml/JavaFValueTestExample.java   | 75 ++
 examples/src/main/python/ml/fvalue_test_example.py | 52 +++
 .../spark/examples/ml/FVlaueTestExample.scala  | 63 ++
 3 files changed, 190 insertions(+)

diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
new file mode 100644
index 000..11861ac
--- /dev/null
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.sql.SparkSession;
+
+// $example on$
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.ml.linalg.VectorUDT;
+import org.apache.spark.ml.stat.FValueTest;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.*;
+// $example off$
+
+/**
+ * An example for FValue testing.
+ * Run with
+ * 
+ * bin/run-example ml.JavaFValueTestExample
+ * 
+ */
+public class JavaFValueTestExample {
+
+  public static void main(String[] args) {
+SparkSession spark = SparkSession
+  .builder()
+  .appName("JavaFValueTestExample")
+  .getOrCreate();
+
+// $example on$
+List data = Arrays.asList(
+  RowFactory.create(4.6, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)),
+  RowFactory.create(6.6, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)),
+  RowFactory.create(5.1, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)),
+  RowFactory.create(7.6, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)),
+  RowFactory.create(9.0, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)),
+  RowFactory.create(9.0, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0))
+);
+
+StructType schema = new StructType(new StructField[]{
+  new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
+  new StructField("features", new VectorUDT(), false, Metadata.empty()),
+});
+
+Dataset df = spark.createDataFrame(data, schema);
+Row r = FValueTest.test(df, "features", "label").head();
+System.out.println("pValues: " + r.get(0).toString());
+System.out.println("degreesOfFreedom: " + r.getList(1).toString());
+System.out.println("fvalue: " + r.get(2).toString());
+
+// $example off$
+
+spark.stop();
+  }
+}
diff --git a/examples/src/main/python/ml/fvalue_test_example.py 
b/examples/src/main/python/ml/fvalue_test_example.py
new file mode 100644
index 000..4a97bcd
--- /dev/null
+++ b/examples/src/main/python/ml/fvalue_test_example.py
@@ -0,0 +1,52 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not

[spark] branch branch-3.0 updated: [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started

2020-04-28 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new f8ff9c5  [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function 
in sql getting started
f8ff9c5 is described below

commit f8ff9c5eff55ba7003a51f9ac91786d16764f4c9
Author: Huaxin Gao 
AuthorDate: Tue Apr 28 11:17:45 2020 -0500

[SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting 
started

### What changes were proposed in this pull request?
Add a paragraph for scalar function in sql getting started

### Why are the changes needed?
To make 3.0 doc complete.

### Does this PR introduce any user-facing change?
before:
https://user-images.githubusercontent.com/13592258/79943182-16d1fd00-841d-11ea-9744-9cdd58d83f81.png;>

after:
https://user-images.githubusercontent.com/13592258/80068256-26704500-84f4-11ea-9845-c835927c027e.png;>

https://user-images.githubusercontent.com/13592258/80165100-82d47280-858f-11ea-8c84-1ef702cc1bff.png;>

### How was this patch tested?

Closes #28290 from huaxingao/scalar.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
(cherry picked from commit dcc09022f1b8ecedf6b64bf35ce5d83500211351)
Signed-off-by: Sean Owen 
---
 docs/sql-getting-started.md | 13 +
 docs/sql-ref-functions.md   |  7 +--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/docs/sql-getting-started.md b/docs/sql-getting-started.md
index dab34af..5a6f182 100644
--- a/docs/sql-getting-started.md
+++ b/docs/sql-getting-started.md
@@ -347,16 +347,13 @@ For example:
 
 
 ## Scalar Functions
-(to be filled soon)
 
-## Aggregations
+Scalar functions are functions that return a single value per row, as opposed 
to aggregation functions, which return a value for a group of rows. Spark SQL 
supports a variety of [Built-in Scalar 
Functions](sql-ref-functions.html#scalar-functions). It also supports [User 
Defined Scalar Functions](sql-ref-functions-udf-scalar.html).
 
-The [built-in DataFrames 
functions](api/scala/org/apache/spark/sql/functions$.html) provide common
-aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, 
etc.
-While those functions are designed for DataFrames, Spark SQL also has 
type-safe versions for some of them in
-[Scala](api/scala/org/apache/spark/sql/expressions/scalalang/typed$.html) and
-[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work 
with strongly typed Datasets.
-Moreover, users are not limited to the predefined aggregate functions and can 
create their own. For more details
+## Aggregate Functions
+
+Aggregate functions are functions that return a single value on a group of 
rows. The [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common 
aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, 
etc.
+Users are not limited to the predefined aggregate functions and can create 
their own. For more details
 about user defined aggregate functions, please refer to the documentation of
 [User Defined Aggregate Functions](sql-ref-functions-udf-aggregate.html).
 
diff --git a/docs/sql-ref-functions.md b/docs/sql-ref-functions.md
index 6368fb7..7493b8b 100644
--- a/docs/sql-ref-functions.md
+++ b/docs/sql-ref-functions.md
@@ -27,13 +27,16 @@ Built-in functions are commonly used routines that Spark 
SQL predefines and a co
 Spark SQL has some categories of frequently-used built-in functions for 
aggregtion, arrays/maps, date/timestamp, and JSON data.
 This subsection presents the usages and descriptions of these functions.
 
- * [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions)
- * [Window Functions](sql-ref-functions-builtin.html#window-functions)
+ Scalar Functions
  * [Array Functions](sql-ref-functions-builtin.html#array-functions)
  * [Map Functions](sql-ref-functions-builtin.html#map-functions)
  * [Date and Timestamp 
Functions](sql-ref-functions-builtin.html#date-and-timestamp-functions)
  * [JSON Functions](sql-ref-functions-builtin.html#json-functions)
 
+ Aggregate-like Functions
+ * [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions)
+ * [Window Functions](sql-ref-functions-builtin.html#window-functions)
+
 ### UDFs (User-Defined Functions)
 
 User-Defined Functions (UDFs) are a feature of Spark SQL that allows users to 
define their own functions when the system's built-in functions are not enough 
to perform the desired task. To use UDFs in Spark SQL, users must first define 
the function, then register the function with Spark, and finally call the 
registered function. The User-Defined Functions can act on a single row or act 
on multiple rows at once. Spark SQL also supports integration of exis

[spark] branch master updated: [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting started

2020-04-28 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dcc0902  [SPARK-29458][SQL][DOCS] Add a paragraph for scalar function 
in sql getting started
dcc0902 is described below

commit dcc09022f1b8ecedf6b64bf35ce5d83500211351
Author: Huaxin Gao 
AuthorDate: Tue Apr 28 11:17:45 2020 -0500

[SPARK-29458][SQL][DOCS] Add a paragraph for scalar function in sql getting 
started

### What changes were proposed in this pull request?
Add a paragraph for scalar function in sql getting started

### Why are the changes needed?
To make 3.0 doc complete.

### Does this PR introduce any user-facing change?
before:
https://user-images.githubusercontent.com/13592258/79943182-16d1fd00-841d-11ea-9744-9cdd58d83f81.png;>

after:
https://user-images.githubusercontent.com/13592258/80068256-26704500-84f4-11ea-9845-c835927c027e.png;>

https://user-images.githubusercontent.com/13592258/80165100-82d47280-858f-11ea-8c84-1ef702cc1bff.png;>

### How was this patch tested?

Closes #28290 from huaxingao/scalar.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
---
 docs/sql-getting-started.md | 13 +
 docs/sql-ref-functions.md   |  7 +--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/docs/sql-getting-started.md b/docs/sql-getting-started.md
index dab34af..5a6f182 100644
--- a/docs/sql-getting-started.md
+++ b/docs/sql-getting-started.md
@@ -347,16 +347,13 @@ For example:
 
 
 ## Scalar Functions
-(to be filled soon)
 
-## Aggregations
+Scalar functions are functions that return a single value per row, as opposed 
to aggregation functions, which return a value for a group of rows. Spark SQL 
supports a variety of [Built-in Scalar 
Functions](sql-ref-functions.html#scalar-functions). It also supports [User 
Defined Scalar Functions](sql-ref-functions-udf-scalar.html).
 
-The [built-in DataFrames 
functions](api/scala/org/apache/spark/sql/functions$.html) provide common
-aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, 
etc.
-While those functions are designed for DataFrames, Spark SQL also has 
type-safe versions for some of them in
-[Scala](api/scala/org/apache/spark/sql/expressions/scalalang/typed$.html) and
-[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work 
with strongly typed Datasets.
-Moreover, users are not limited to the predefined aggregate functions and can 
create their own. For more details
+## Aggregate Functions
+
+Aggregate functions are functions that return a single value on a group of 
rows. The [Built-in Aggregation 
Functions](sql-ref-functions-builtin.html#aggregate-functions) provide common 
aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, 
etc.
+Users are not limited to the predefined aggregate functions and can create 
their own. For more details
 about user defined aggregate functions, please refer to the documentation of
 [User Defined Aggregate Functions](sql-ref-functions-udf-aggregate.html).
 
diff --git a/docs/sql-ref-functions.md b/docs/sql-ref-functions.md
index 6368fb7..7493b8b 100644
--- a/docs/sql-ref-functions.md
+++ b/docs/sql-ref-functions.md
@@ -27,13 +27,16 @@ Built-in functions are commonly used routines that Spark 
SQL predefines and a co
 Spark SQL has some categories of frequently-used built-in functions for 
aggregtion, arrays/maps, date/timestamp, and JSON data.
 This subsection presents the usages and descriptions of these functions.
 
- * [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions)
- * [Window Functions](sql-ref-functions-builtin.html#window-functions)
+ Scalar Functions
  * [Array Functions](sql-ref-functions-builtin.html#array-functions)
  * [Map Functions](sql-ref-functions-builtin.html#map-functions)
  * [Date and Timestamp 
Functions](sql-ref-functions-builtin.html#date-and-timestamp-functions)
  * [JSON Functions](sql-ref-functions-builtin.html#json-functions)
 
+ Aggregate-like Functions
+ * [Aggregate Functions](sql-ref-functions-builtin.html#aggregate-functions)
+ * [Window Functions](sql-ref-functions-builtin.html#window-functions)
+
 ### UDFs (User-Defined Functions)
 
 User-Defined Functions (UDFs) are a feature of Spark SQL that allows users to 
define their own functions when the system's built-in functions are not enough 
to perform the desired task. To use UDFs in Spark SQL, users must first define 
the function, then register the function with Spark, and finally call the 
registered function. The User-Defined Functions can act on a single row or act 
on multiple rows at once. Spark SQL also supports integration of existing Hive 
implementation

[spark] branch branch-3.0 updated (3b30066 -> 6f10c8a)

2020-04-27 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 3b30066  [SPARK-31529][SQL][3.0] Remove extra whitespaces in formatted 
explain
 add 6f10c8a  [SPARK-31569][SQL][DOCS] Add links to subsections in SQL 
Reference main page

No new revisions were added by this update.

Summary of changes:
 docs/sql-ref.md | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31569][SQL][DOCS] Add links to subsections in SQL Reference main page

2020-04-27 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7735db2a2 [SPARK-31569][SQL][DOCS] Add links to subsections in SQL 
Reference main page
7735db2a2 is described below

commit 7735db2a273edf208ae50e88926c9f7a77e5dbac
Author: Huaxin Gao 
AuthorDate: Mon Apr 27 09:45:00 2020 -0500

[SPARK-31569][SQL][DOCS] Add links to subsections in SQL Reference main page

### What changes were proposed in this pull request?
Add links to subsections in SQL Reference main page

### Why are the changes needed?
Make SQL Reference complete

### Does this PR introduce any user-facing change?
Yes
before:
https://user-images.githubusercontent.com/13592258/80338238-a9551080-8810-11ea-8ae8-d6707fde2cac.png;>

after:
https://user-images.githubusercontent.com/13592258/80338241-ac500100-8810-11ea-8518-95c4f8c0a2eb.png;>

### How was this patch tested?
Manually build and check.

Closes #28360 from huaxingao/sql-ref.

Authored-by: Huaxin Gao 
Signed-off-by: Sean Owen 
---
 docs/sql-ref.md | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/docs/sql-ref.md b/docs/sql-ref.md
index 6c57b0d6..db51fe1 100644
--- a/docs/sql-ref.md
+++ b/docs/sql-ref.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Reference
-displayTitle: Reference
+title: SQL Reference
+displayTitle: SQL Reference
 license: |
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
@@ -19,7 +19,21 @@ license: |
   limitations under the License.
 ---
 
-Spark SQL is Apache Spark's module for working with structured data.
-This guide is a reference for Structured Query Language (SQL) for Apache 
-Spark. This document describes the SQL constructs supported by Spark in detail
-along with usage examples when applicable.
+Spark SQL is Apache Spark's module for working with structured data. This 
guide is a reference for Structured Query Language (SQL) and includes syntax, 
semantics, keywords, and examples for common SQL usage. It contains information 
for the following topics:
+
+ * [Data Types](sql-ref-datatypes.html)
+ * [Identifiers](sql-ref-identifier.html)
+ * [Literals](sql-ref-literals.html)
+ * [Null Semanitics](sql-ref-null-semantics.html)
+ * [ANSI Compliance](sql-ref-ansi-compliance.html)
+ * [SQL Syntax](sql-ref-syntax.html)
+   * [DDL Statements](sql-ref-syntax-ddl.html)
+   * [DML Statements](sql-ref-syntax-ddl.html)
+   * [Data Retrieval Statements](sql-ref-syntax-qry.html)
+   * [Auxiliary Statements](sql-ref-syntax-aux.html)
+ * [Functions](sql-ref-functions.html)
+   * [Built-in Functions](sql-ref-functions-builtin.html)
+   * [Scalar User-Defined Functions (UDFs)](sql-ref-functions-udf-scalar.html)
+   * [User-Defined Aggregate Functions 
(UDAFs)](sql-ref-functions-udf-aggregate.html)
+   * [Integration with Hive UDFs/UDAFs/UDTFs](sql-ref-functions-udf-hive.html)
+ * [Datetime Pattern](sql-ref-datetime-pattern.html)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31400][ML] The catalogString doesn't distinguish Vectors in ml and mllib

2020-04-26 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fe07b21  [SPARK-31400][ML] The catalogString doesn't distinguish 
Vectors in ml and mllib
fe07b21 is described below

commit fe07b21b8ab60def6c4451c661e4dd46a4d48b5a
Author: TJX2014 
AuthorDate: Sun Apr 26 11:35:44 2020 -0500

[SPARK-31400][ML] The catalogString doesn't distinguish Vectors in ml and 
mllib

What changes were proposed in this pull request?
1.Add class info output in 
org.apache.spark.ml.util.SchemaUtils#checkColumnType to distinct Vectors in ml 
and mllib
2.Add unit test

Why are the changes needed?
the catalogString doesn't distinguish Vectors in ml and mllib when mllib 
vector misused in ml
https://issues.apache.org/jira/browse/SPARK-31400

Does this PR introduce any user-facing change?
No

How was this patch tested?
Unit test is added

Closes #28347 from 
TJX2014/master-catalogString-distinguish-Vectors-in-ml-and-mllib.

Authored-by: TJX2014 
Signed-off-by: Sean Owen 
---
 .../org/apache/spark/ml/util/SchemaUtils.scala  |  4 ++--
 .../apache/spark/mllib/util/TestingUtilsSuite.scala | 21 -
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala 
b/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala
index 752069d..c08d7e8 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala
@@ -42,8 +42,8 @@ private[spark] object SchemaUtils {
 val actualDataType = schema(colName).dataType
 val message = if (msg != null && msg.trim.length > 0) " " + msg else ""
 require(actualDataType.equals(dataType),
-  s"Column $colName must be of type ${dataType.catalogString} but was 
actually " +
-s"${actualDataType.catalogString}.$message")
+  s"Column $colName must be of type 
${dataType.getClass}:${dataType.catalogString} " +
+s"but was actually 
${actualDataType.getClass}:${actualDataType.catalogString}.$message")
   }
 
   /**
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtilsSuite.scala 
b/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtilsSuite.scala
index 3fcf1cf..bc80e86 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtilsSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtilsSuite.scala
@@ -20,9 +20,11 @@ package org.apache.spark.mllib.util
 import org.scalatest.exceptions.TestFailedException
 
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.VectorUDT
+import org.apache.spark.ml.util.SchemaUtils
 import org.apache.spark.mllib.linalg.{Matrices, Vectors}
 import org.apache.spark.mllib.util.TestingUtils._
-
+import org.apache.spark.sql.types.{StructField, StructType}
 class TestingUtilsSuite extends SparkFunSuite {
 
   test("Comparing doubles using relative error.") {
@@ -457,4 +459,21 @@ class TestingUtilsSuite extends SparkFunSuite {
 assert(Matrices.sparse(2, 2, Array(0, 1, 2), Array(0, 1), Array(3.1, 3.5)) 
!~=
   Matrices.dense(0, 0, Array()) relTol 0.01)
   }
+
+  test("SPARK-31400, catalogString distinguish Vectors in ml and mllib") {
+val schema = StructType(Array[StructField] {
+  StructField("features", new org.apache.spark.mllib.linalg.VectorUDT)
+})
+val e = intercept[IllegalArgumentException] {
+  SchemaUtils.checkColumnType(schema, "features", new VectorUDT)
+}
+assert(e.getMessage.contains(
+  
"org.apache.spark.mllib.linalg.VectorUDT:struct"),
+  "dataType is not desired")
+
+val normalSchema = StructType(Array[StructField] {
+  StructField("features", new VectorUDT)
+})
+SchemaUtils.checkColumnType(normalSchema, "features", new VectorUDT)
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b10263b -> 0ede08b)

2020-04-24 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b10263b  [SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators
 add 0ede08b  [SPARK-31007][ML] KMeans optimization based on 
triangle-inequality

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/ml/impl/Utils.scala |  53 -
 .../spark/ml/clustering/GaussianMixture.scala  |  16 +-
 .../spark/mllib/clustering/DistanceMeasure.scala   | 223 -
 .../org/apache/spark/mllib/clustering/KMeans.scala |  52 +++--
 .../spark/mllib/clustering/KMeansModel.scala   |  14 +-
 .../mllib/clustering/DistanceMeasureSuite.scala|  77 +++
 6 files changed, 390 insertions(+), 45 deletions(-)
 create mode 100644 
mllib/src/test/scala/org/apache/spark/mllib/clustering/DistanceMeasureSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (b10263b -> 0ede08b)

2020-04-24 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from b10263b  [SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators
 add 0ede08b  [SPARK-31007][ML] KMeans optimization based on 
triangle-inequality

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/ml/impl/Utils.scala |  53 -
 .../spark/ml/clustering/GaussianMixture.scala  |  16 +-
 .../spark/mllib/clustering/DistanceMeasure.scala   | 223 -
 .../org/apache/spark/mllib/clustering/KMeans.scala |  52 +++--
 .../spark/mllib/clustering/KMeansModel.scala   |  14 +-
 .../mllib/clustering/DistanceMeasureSuite.scala|  77 +++
 6 files changed, 390 insertions(+), 45 deletions(-)
 create mode 100644 
mllib/src/test/scala/org/apache/spark/mllib/clustering/DistanceMeasureSuite.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment

2020-04-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 6bc6b0d  [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's 
class comment
6bc6b0d is described below

commit 6bc6b0d4400f2ba0338770662ebafad8a0de41ac
Author: Cong Du 
AuthorDate: Wed Apr 22 09:44:43 2020 -0500

[MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment

### What changes were proposed in this pull request?
This PR fixes a typo in 
deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala file.

### Why are the changes needed?
To deliver correct explanation about how the placement policy works.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
UT as specified, although shouldn't influence any functionality since it's 
in the comment.

Closes #28267 from asclepiusaka/master.

Authored-by: Cong Du 
Signed-off-by: Sean Owen 
(cherry picked from commit 54b97b2e143774a7238fc5a5f63e0d6eec138c41)
Signed-off-by: Sean Owen 
---
 .../yarn/LocalityPreferredContainerPlacementStrategy.scala | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala
index 2288bb5..3e33382 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala
@@ -40,7 +40,7 @@ private[yarn] case class ContainerLocalityPreferences(nodes: 
Array[String], rack
  * and cpus per task is 1, so the required container number is 15,
  * and host ratio is (host1: 30, host2: 30, host3: 20, host4: 10).
  *
- * 1. If requested container number (18) is more than the required container 
number (15):
+ * 1. If the requested container number (18) is more than the required 
container number (15):
  *
  * requests for 5 containers with nodes: (host1, host2, host3, host4)
  * requests for 5 containers with nodes: (host1, host2, host3)
@@ -63,16 +63,16 @@ private[yarn] case class 
ContainerLocalityPreferences(nodes: Array[String], rack
  * follow the method of 1 and 2.
  *
  * 4. If containers exist and some of them can match the requested localities.
- * For example if we have 1 containers on each node (host1: 1, host2: 1: 
host3: 1, host4: 1),
+ * For example if we have 1 container on each node (host1: 1, host2: 1: host3: 
1, host4: 1),
  * and the expected containers on each node would be (host1: 5, host2: 5, 
host3: 4, host4: 2),
  * so the newly requested containers on each node would be updated to (host1: 
4, host2: 4,
  * host3: 3, host4: 1), 12 containers by total.
  *
  *   4.1 If requested container number (18) is more than newly required 
containers (12). Follow
- *   method 1 with updated ratio 4 : 4 : 3 : 1.
+ *   method 1 with an updated ratio 4 : 4 : 3 : 1.
  *
- *   4.2 If request container number (10) is more than newly required 
containers (12). Follow
- *   method 2 with updated ratio 4 : 4 : 3 : 1.
+ *   4.2 If request container number (10) is less than newly required 
containers (12). Follow
+ *   method 2 with an updated ratio 4 : 4 : 3 : 1.
  *
  * 5. If containers exist and existing localities can fully cover the 
requested localities.
  * For example if we have 5 containers on each node (host1: 5, host2: 5, 
host3: 5, host4: 5),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.0 updated: [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment

2020-04-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 6bc6b0d  [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's 
class comment
6bc6b0d is described below

commit 6bc6b0d4400f2ba0338770662ebafad8a0de41ac
Author: Cong Du 
AuthorDate: Wed Apr 22 09:44:43 2020 -0500

[MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's class comment

### What changes were proposed in this pull request?
This PR fixes a typo in 
deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala file.

### Why are the changes needed?
To deliver correct explanation about how the placement policy works.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
UT as specified, although shouldn't influence any functionality since it's 
in the comment.

Closes #28267 from asclepiusaka/master.

Authored-by: Cong Du 
Signed-off-by: Sean Owen 
(cherry picked from commit 54b97b2e143774a7238fc5a5f63e0d6eec138c41)
Signed-off-by: Sean Owen 
---
 .../yarn/LocalityPreferredContainerPlacementStrategy.scala | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala
 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala
index 2288bb5..3e33382 100644
--- 
a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala
+++ 
b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala
@@ -40,7 +40,7 @@ private[yarn] case class ContainerLocalityPreferences(nodes: 
Array[String], rack
  * and cpus per task is 1, so the required container number is 15,
  * and host ratio is (host1: 30, host2: 30, host3: 20, host4: 10).
  *
- * 1. If requested container number (18) is more than the required container 
number (15):
+ * 1. If the requested container number (18) is more than the required 
container number (15):
  *
  * requests for 5 containers with nodes: (host1, host2, host3, host4)
  * requests for 5 containers with nodes: (host1, host2, host3)
@@ -63,16 +63,16 @@ private[yarn] case class 
ContainerLocalityPreferences(nodes: Array[String], rack
  * follow the method of 1 and 2.
  *
  * 4. If containers exist and some of them can match the requested localities.
- * For example if we have 1 containers on each node (host1: 1, host2: 1: 
host3: 1, host4: 1),
+ * For example if we have 1 container on each node (host1: 1, host2: 1: host3: 
1, host4: 1),
  * and the expected containers on each node would be (host1: 5, host2: 5, 
host3: 4, host4: 2),
  * so the newly requested containers on each node would be updated to (host1: 
4, host2: 4,
  * host3: 3, host4: 1), 12 containers by total.
  *
  *   4.1 If requested container number (18) is more than newly required 
containers (12). Follow
- *   method 1 with updated ratio 4 : 4 : 3 : 1.
+ *   method 1 with an updated ratio 4 : 4 : 3 : 1.
  *
- *   4.2 If request container number (10) is more than newly required 
containers (12). Follow
- *   method 2 with updated ratio 4 : 4 : 3 : 1.
+ *   4.2 If request container number (10) is less than newly required 
containers (12). Follow
+ *   method 2 with an updated ratio 4 : 4 : 3 : 1.
  *
  * 5. If containers exist and existing localities can fully cover the 
requested localities.
  * For example if we have 5 containers on each node (host1: 5, host2: 5, 
host3: 5, host4: 5),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (8b77b31 -> 54b97b2)

2020-04-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8b77b31  [SPARK-18886][CORE][FOLLOWUP] allow follow up locality resets 
even if no task was launched
 add 54b97b2  [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's 
class comment

No new revisions were added by this update.

Summary of changes:
 .../yarn/LocalityPreferredContainerPlacementStrategy.scala | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (8b77b31 -> 54b97b2)

2020-04-22 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 8b77b31  [SPARK-18886][CORE][FOLLOWUP] allow follow up locality resets 
even if no task was launched
 add 54b97b2  [MINOR][DOCS] Fix a typo in ContainerPlacementStrategy's 
class comment

No new revisions were added by this update.

Summary of changes:
 .../yarn/LocalityPreferredContainerPlacementStrategy.scala | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: Apply appropriate RPC handler to receive, receiveStream when auth enabled

2020-04-18 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
 new 9416b7c  Apply appropriate RPC handler to receive, receiveStream when 
auth enabled
9416b7c is described below

commit 9416b7c54bdf5613c1a65e6d1779a87591c6c9bd
Author: Sean Owen 
AuthorDate: Fri Apr 17 13:25:12 2020 -0500

Apply appropriate RPC handler to receive, receiveStream when auth enabled
---
 .../spark/network/crypto/AuthRpcHandler.java   |  73 +++---
 .../apache/spark/network/sasl/SaslRpcHandler.java  |  60 +++-
 .../network/server/AbstractAuthRpcHandler.java | 107 +
 .../spark/network/crypto/AuthIntegrationSuite.java |  12 +--
 .../apache/spark/network/sasl/SparkSaslSuite.java  |   3 +-
 5 files changed, 142 insertions(+), 113 deletions(-)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java
 
b/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java
index 821cc7a..dd31c95 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/crypto/AuthRpcHandler.java
@@ -29,12 +29,11 @@ import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import org.apache.spark.network.client.RpcResponseCallback;
-import org.apache.spark.network.client.StreamCallbackWithID;
 import org.apache.spark.network.client.TransportClient;
 import org.apache.spark.network.sasl.SecretKeyHolder;
 import org.apache.spark.network.sasl.SaslRpcHandler;
+import org.apache.spark.network.server.AbstractAuthRpcHandler;
 import org.apache.spark.network.server.RpcHandler;
-import org.apache.spark.network.server.StreamManager;
 import org.apache.spark.network.util.TransportConf;
 
 /**
@@ -46,7 +45,7 @@ import org.apache.spark.network.util.TransportConf;
  * The delegate will only receive messages if the given connection has been 
successfully
  * authenticated. A connection may be authenticated at most once.
  */
-class AuthRpcHandler extends RpcHandler {
+class AuthRpcHandler extends AbstractAuthRpcHandler {
   private static final Logger LOG = 
LoggerFactory.getLogger(AuthRpcHandler.class);
 
   /** Transport configuration. */
@@ -55,36 +54,31 @@ class AuthRpcHandler extends RpcHandler {
   /** The client channel. */
   private final Channel channel;
 
-  /**
-   * RpcHandler we will delegate to for authenticated connections. When 
falling back to SASL
-   * this will be replaced with the SASL RPC handler.
-   */
-  @VisibleForTesting
-  RpcHandler delegate;
-
   /** Class which provides secret keys which are shared by server and client 
on a per-app basis. */
   private final SecretKeyHolder secretKeyHolder;
 
-  /** Whether auth is done and future calls should be delegated. */
+  /** RPC handler for auth handshake when falling back to SASL auth. */
   @VisibleForTesting
-  boolean doDelegate;
+  SaslRpcHandler saslHandler;
 
   AuthRpcHandler(
   TransportConf conf,
   Channel channel,
   RpcHandler delegate,
   SecretKeyHolder secretKeyHolder) {
+super(delegate);
 this.conf = conf;
 this.channel = channel;
-this.delegate = delegate;
 this.secretKeyHolder = secretKeyHolder;
   }
 
   @Override
-  public void receive(TransportClient client, ByteBuffer message, 
RpcResponseCallback callback) {
-if (doDelegate) {
-  delegate.receive(client, message, callback);
-  return;
+  protected boolean doAuthChallenge(
+  TransportClient client,
+  ByteBuffer message,
+  RpcResponseCallback callback) {
+if (saslHandler != null) {
+  return saslHandler.doAuthChallenge(client, message, callback);
 }
 
 int position = message.position();
@@ -98,18 +92,17 @@ class AuthRpcHandler extends RpcHandler {
   if (conf.saslFallback()) {
 LOG.warn("Failed to parse new auth challenge, reverting to SASL for 
client {}.",
   channel.remoteAddress());
-delegate = new SaslRpcHandler(conf, channel, delegate, 
secretKeyHolder);
+saslHandler = new SaslRpcHandler(conf, channel, null, secretKeyHolder);
 message.position(position);
 message.limit(limit);
-delegate.receive(client, message, callback);
-doDelegate = true;
+return saslHandler.doAuthChallenge(client, message, callback);
   } else {
 LOG.debug("Unexpected challenge message from client {}, closing 
channel.",
   channel.remoteAddress());
 callback.onFailure(new IllegalArgumentException("Unknown challenge 
message."));
 channel.close();
   }
-  return;
+  return false;
 }
 
 // Here we have the client challenge, so perform the new auth protocol and 
set up the channel.
@@ -131,7

< 7 8 9 10 11 12 13 14 15 16 >

1101 - 1200 of 20677 matches

Mail list logo