[spark] branch master updated (74a6b9d -> 95a5c17)

2021-07-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 74a6b9d  [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default 
value as 'sequence' in default index in pandas on Spark
 add 95a5c17  Revert "[SPARK-36345][INFRA] Update PySpark GitHubAction 
docker image to 20210730"

No new revisions were added by this update.

Summary of changes:
 .github/workflows/build_and_test.yml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark

2021-07-30 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 187aa6a  [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default 
value as 'sequence' in default index in pandas on Spark
187aa6a is described below

commit 187aa6ab7f3766ce8ebe6dd22c19b20e9bf043c3
Author: Hyukjin Kwon 
AuthorDate: Sat Jul 31 08:31:10 2021 +0900

[SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 
'sequence' in default index in pandas on Spark

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/33570, which 
mistakenly changed the default value of the default index

### Why are the changes needed?

It was mistakenly changed. It was changed to check if the tests actually 
pass but I forgot to change it back.

### Does this PR introduce _any_ user-facing change?

No, it's not related yet. It fixes up the mistake of the default value 
mistakenly changed.
(Changed default value makes the test flaky because of the order affected 
by extra shuffle)

### How was this patch tested?

Manually tested.

Closes #33596 from HyukjinKwon/SPARK-36338-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 74a6b9d23b0280e07561714018ee7b7f31d2a2f3)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/config.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/config.py b/python/pyspark/pandas/config.py
index 10acc8c..b03f2e1 100644
--- a/python/pyspark/pandas/config.py
+++ b/python/pyspark/pandas/config.py
@@ -175,7 +175,7 @@ _options = [
 Option(
 key="compute.default_index_type",
 doc=("This sets the default index type: sequence, distributed and 
distributed-sequence."),
-default="distributed-sequence",
+default="sequence",
 types=str,
 check_func=(
 lambda v: v in ("sequence", "distributed", "distributed-sequence"),

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (0e65ed5 -> 74a6b9d)

2021-07-30 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 0e65ed5  [SPARK-36345][INFRA] Update PySpark GitHubAction docker image 
to 20210730
 add 74a6b9d  [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default 
value as 'sequence' in default index in pandas on Spark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/config.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730

2021-07-30 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 586f898  [SPARK-36345][INFRA] Update PySpark GitHubAction docker image 
to 20210730
586f898 is described below

commit 586f89886fc8029d35aaa11021a4a88909c85804
Author: Dongjoon Hyun 
AuthorDate: Sat Jul 31 07:20:17 2021 +0900

[SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730

### What changes were proposed in this pull request?

This PR aims to upgrade PySpark GitHub Action job to use the latest docker 
image `20210730` having `sklearn` and `mlflow` additionally.
- 
https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/commit/5ca94453d1108dfe40bceb8872387a1b19b0c783

```
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 
python3.9 -m pip list | grep mlflow
mlflow1.19.0

$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 
python3.9 -m pip list | grep sklearn
sklearn   0.0
```

### Why are the changes needed?

This will save the installation time.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action PySpark jobs.

Closes #33595 from dongjoon-hyun/SPARK-36345.

Authored-by: Dongjoon Hyun 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 0e65ed5fb9c62671789a651a993abbb9f546367c)
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 3eb12f5..d247e6b 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -149,7 +149,7 @@ jobs:
 name: "Build modules: ${{ matrix.modules }}"
 runs-on: ubuntu-20.04
 container:
-  image: dongjoon/apache-spark-github-action-image:20210602
+  image: dongjoon/apache-spark-github-action-image:20210730
 strategy:
   fail-fast: false
   matrix:
@@ -227,8 +227,6 @@ jobs:
 # Run the tests.
 - name: Run tests
   run: |
-# TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of 
the base image
-python3.9 -m pip install 'mlflow>=1.0' sklearn
 export PATH=$PATH:$HOME/miniconda/bin
 ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST"
 - name: Upload test results to report

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730

2021-07-30 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0e65ed5  [SPARK-36345][INFRA] Update PySpark GitHubAction docker image 
to 20210730
0e65ed5 is described below

commit 0e65ed5fb9c62671789a651a993abbb9f546367c
Author: Dongjoon Hyun 
AuthorDate: Sat Jul 31 07:20:17 2021 +0900

[SPARK-36345][INFRA] Update PySpark GitHubAction docker image to 20210730

### What changes were proposed in this pull request?

This PR aims to upgrade PySpark GitHub Action job to use the latest docker 
image `20210730` having `sklearn` and `mlflow` additionally.
- 
https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage/commit/5ca94453d1108dfe40bceb8872387a1b19b0c783

```
$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 
python3.9 -m pip list | grep mlflow
mlflow1.19.0

$ docker run -it --rm dongjoon/apache-spark-github-action-image:20210730 
python3.9 -m pip list | grep sklearn
sklearn   0.0
```

### Why are the changes needed?

This will save the installation time.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action PySpark jobs.

Closes #33595 from dongjoon-hyun/SPARK-36345.

Authored-by: Dongjoon Hyun 
Signed-off-by: Hyukjin Kwon 
---
 .github/workflows/build_and_test.yml | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 17908ff..58487a4 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -186,7 +186,7 @@ jobs:
 name: "Build modules: ${{ matrix.modules }}"
 runs-on: ubuntu-20.04
 container:
-  image: dongjoon/apache-spark-github-action-image:20210602
+  image: dongjoon/apache-spark-github-action-image:20210730
 strategy:
   fail-fast: false
   matrix:
@@ -252,8 +252,6 @@ jobs:
 # Run the tests.
 - name: Run tests
   run: |
-# TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of 
the base image
-python3.9 -m pip install 'mlflow>=1.0' sklearn
 export PATH=$PATH:$HOME/miniconda/bin
 ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST"
 - name: Upload test results to report

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec

2021-07-30 Thread tgraves
This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new f9f5656  [SPARK-35881][SQL] Add support for columnar execution of 
final query stage in AdaptiveSparkPlanExec
f9f5656 is described below

commit f9f5656491c1edbfbc9f8c13840aaa935c49037f
Author: Andy Grove 
AuthorDate: Fri Jul 30 13:21:50 2021 -0500

[SPARK-35881][SQL] Add support for columnar execution of final query stage 
in AdaptiveSparkPlanExec

### What changes were proposed in this pull request?

Changes in this PR:

- `AdaptiveSparkPlanExec` has new methods `finalPlanSupportsColumnar` and 
`doExecuteColumnar` to support adaptive queries where the final query stage 
produces columnar data.
- `SessionState` now has a new set of injectable rules named 
`finalQueryStagePrepRules` that can be applied to the final query stage.
- `AdaptiveSparkPlanExec` can now safely be wrapped by either 
`RowToColumnarExec` or `ColumnarToRowExec`.

A Spark plugin can use the new rules to remove the root `ColumnarToRowExec` 
transition that is inserted by previous rules and at execution time can call 
`finalPlanSupportsColumnar` to see if the final query stage is columnar. If the 
plan is columnar then the plugin can safely call `doExecuteColumnar`. The 
adaptive plan can be wrapped in either `RowToColumnarExec` or 
`ColumnarToRowExec` to force a particular output format. There are fast paths 
in both of these operators to avoid any re [...]

### Why are the changes needed?

Without this change it is necessary to use reflection to get the final 
physical plan to determine whether it is columnar and to execute it is a 
columnar plan. `AdaptiveSparkPlanExec` only provides public methods for 
row-based execution.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I have manually tested this patch with the RAPIDS Accelerator for Apache 
Spark.

Closes #33140 from andygrove/support-columnar-adaptive.

Authored-by: Andy Grove 
Signed-off-by: Thomas Graves 
(cherry picked from commit 0f538402fb76e4d6182cc881219d53b5fdf73af1)
Signed-off-by: Thomas Graves 
---
 .../apache/spark/sql/SparkSessionExtensions.scala  |  22 +++-
 .../org/apache/spark/sql/execution/Columnar.scala  | 134 -
 .../execution/adaptive/AdaptiveSparkPlanExec.scala |  46 +--
 .../sql/internal/BaseSessionStateBuilder.scala |   7 +-
 .../apache/spark/sql/internal/SessionState.scala   |   3 +-
 5 files changed, 141 insertions(+), 71 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
index b14dce6..18ebae5 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
@@ -47,6 +47,7 @@ import org.apache.spark.sql.execution.{ColumnarRule, 
SparkPlan}
  * (External) Catalog listeners.
  * Columnar Rules.
  * Adaptive Query Stage Preparation Rules.
+ * Adaptive Query Post Stage Preparation Rules.
  * 
  *
  * The extensions can be used by calling `withExtensions` on the 
[[SparkSession.Builder]], for
@@ -110,9 +111,12 @@ class SparkSessionExtensions {
   type TableFunctionDescription = (FunctionIdentifier, ExpressionInfo, 
TableFunctionBuilder)
   type ColumnarRuleBuilder = SparkSession => ColumnarRule
   type QueryStagePrepRuleBuilder = SparkSession => Rule[SparkPlan]
+  type PostStageCreationRuleBuilder = SparkSession => Rule[SparkPlan]
 
   private[this] val columnarRuleBuilders = 
mutable.Buffer.empty[ColumnarRuleBuilder]
   private[this] val queryStagePrepRuleBuilders = 
mutable.Buffer.empty[QueryStagePrepRuleBuilder]
+  private[this] val postStageCreationRuleBuilders =
+mutable.Buffer.empty[PostStageCreationRuleBuilder]
 
   /**
* Build the override rules for columnar execution.
@@ -129,6 +133,14 @@ class SparkSessionExtensions {
   }
 
   /**
+   * Build the override rules for the final query stage preparation phase of 
adaptive query
+   * execution.
+   */
+  private[sql] def buildPostStageCreationRules(session: SparkSession): 
Seq[Rule[SparkPlan]] = {
+postStageCreationRuleBuilders.map(_.apply(session)).toSeq
+  }
+
+  /**
* Inject a rule that can override the columnar execution of an executor.
*/
   def injectColumnar(builder: ColumnarRuleBuilder): Unit = {
@@ -136,13 +148,21 @@ class SparkSessionExtensions {
   }
 
   /**
-   * Inject a rule that can override the the query stage preparation phase of 
adaptive query
+   * Inject a rule that can override the query stage preparation phase of 
adaptive query
* execution.
*/
   def injectQueryStagePrepRule(builder: 

[spark] branch master updated: [SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec

2021-07-30 Thread tgraves
This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0f53840  [SPARK-35881][SQL] Add support for columnar execution of 
final query stage in AdaptiveSparkPlanExec
0f53840 is described below

commit 0f538402fb76e4d6182cc881219d53b5fdf73af1
Author: Andy Grove 
AuthorDate: Fri Jul 30 13:21:50 2021 -0500

[SPARK-35881][SQL] Add support for columnar execution of final query stage 
in AdaptiveSparkPlanExec

### What changes were proposed in this pull request?

Changes in this PR:

- `AdaptiveSparkPlanExec` has new methods `finalPlanSupportsColumnar` and 
`doExecuteColumnar` to support adaptive queries where the final query stage 
produces columnar data.
- `SessionState` now has a new set of injectable rules named 
`finalQueryStagePrepRules` that can be applied to the final query stage.
- `AdaptiveSparkPlanExec` can now safely be wrapped by either 
`RowToColumnarExec` or `ColumnarToRowExec`.

A Spark plugin can use the new rules to remove the root `ColumnarToRowExec` 
transition that is inserted by previous rules and at execution time can call 
`finalPlanSupportsColumnar` to see if the final query stage is columnar. If the 
plan is columnar then the plugin can safely call `doExecuteColumnar`. The 
adaptive plan can be wrapped in either `RowToColumnarExec` or 
`ColumnarToRowExec` to force a particular output format. There are fast paths 
in both of these operators to avoid any re [...]

### Why are the changes needed?

Without this change it is necessary to use reflection to get the final 
physical plan to determine whether it is columnar and to execute it is a 
columnar plan. `AdaptiveSparkPlanExec` only provides public methods for 
row-based execution.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I have manually tested this patch with the RAPIDS Accelerator for Apache 
Spark.

Closes #33140 from andygrove/support-columnar-adaptive.

Authored-by: Andy Grove 
Signed-off-by: Thomas Graves 
---
 .../apache/spark/sql/SparkSessionExtensions.scala  |  22 +++-
 .../org/apache/spark/sql/execution/Columnar.scala  | 134 -
 .../execution/adaptive/AdaptiveSparkPlanExec.scala |  46 +--
 .../sql/internal/BaseSessionStateBuilder.scala |   7 +-
 .../apache/spark/sql/internal/SessionState.scala   |   3 +-
 5 files changed, 141 insertions(+), 71 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
index b14dce6..18ebae5 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SparkSessionExtensions.scala
@@ -47,6 +47,7 @@ import org.apache.spark.sql.execution.{ColumnarRule, 
SparkPlan}
  * (External) Catalog listeners.
  * Columnar Rules.
  * Adaptive Query Stage Preparation Rules.
+ * Adaptive Query Post Stage Preparation Rules.
  * 
  *
  * The extensions can be used by calling `withExtensions` on the 
[[SparkSession.Builder]], for
@@ -110,9 +111,12 @@ class SparkSessionExtensions {
   type TableFunctionDescription = (FunctionIdentifier, ExpressionInfo, 
TableFunctionBuilder)
   type ColumnarRuleBuilder = SparkSession => ColumnarRule
   type QueryStagePrepRuleBuilder = SparkSession => Rule[SparkPlan]
+  type PostStageCreationRuleBuilder = SparkSession => Rule[SparkPlan]
 
   private[this] val columnarRuleBuilders = 
mutable.Buffer.empty[ColumnarRuleBuilder]
   private[this] val queryStagePrepRuleBuilders = 
mutable.Buffer.empty[QueryStagePrepRuleBuilder]
+  private[this] val postStageCreationRuleBuilders =
+mutable.Buffer.empty[PostStageCreationRuleBuilder]
 
   /**
* Build the override rules for columnar execution.
@@ -129,6 +133,14 @@ class SparkSessionExtensions {
   }
 
   /**
+   * Build the override rules for the final query stage preparation phase of 
adaptive query
+   * execution.
+   */
+  private[sql] def buildPostStageCreationRules(session: SparkSession): 
Seq[Rule[SparkPlan]] = {
+postStageCreationRuleBuilders.map(_.apply(session)).toSeq
+  }
+
+  /**
* Inject a rule that can override the columnar execution of an executor.
*/
   def injectColumnar(builder: ColumnarRuleBuilder): Unit = {
@@ -136,13 +148,21 @@ class SparkSessionExtensions {
   }
 
   /**
-   * Inject a rule that can override the the query stage preparation phase of 
adaptive query
+   * Inject a rule that can override the query stage preparation phase of 
adaptive query
* execution.
*/
   def injectQueryStagePrepRule(builder: QueryStagePrepRuleBuilder): Unit = {
 queryStagePrepRuleBuilders += builder
   }
 
+  /**
+   * Inject a rule that 

[spark] branch branch-3.2 updated: [SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps

2021-07-30 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new a4dcda1  [SPARK-36350][PYTHON] Move some logic related to F.nanvl to 
DataTypeOps
a4dcda1 is described below

commit a4dcda179448cc7632f5022e26be3cafb3d82505
Author: Takuya UESHIN 
AuthorDate: Fri Jul 30 11:19:49 2021 -0700

[SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps

### What changes were proposed in this pull request?

Move some logic related to `F.nanvl` to `DataTypeOps`.

### Why are the changes needed?

There are several places to branch by `FloatType` or `DoubleType` to use 
`F.nanvl` but `DataTypeOps` should handle it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33582 from ueshin/issues/SPARK-36350/nan_to_null.

Authored-by: Takuya UESHIN 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit 895e3f5e2aff46f5c4eed8ac9ddf4dbfb16ef5fd)
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/data_type_ops/base.py|  3 ++
 python/pyspark/pandas/data_type_ops/num_ops.py | 11 +
 python/pyspark/pandas/frame.py | 50 +++
 python/pyspark/pandas/generic.py   | 66 --
 python/pyspark/pandas/groupby.py   | 66 +-
 python/pyspark/pandas/series.py| 22 +++--
 6 files changed, 112 insertions(+), 106 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/base.py 
b/python/pyspark/pandas/data_type_ops/base.py
index 7eb2a95..743b2c5 100644
--- a/python/pyspark/pandas/data_type_ops/base.py
+++ b/python/pyspark/pandas/data_type_ops/base.py
@@ -366,5 +366,8 @@ class DataTypeOps(object, metaclass=ABCMeta):
 ),
 )
 
+def nan_to_null(self, index_ops: IndexOpsLike) -> IndexOpsLike:
+return index_ops.copy()
+
 def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) 
-> IndexOpsLike:
 raise TypeError("astype can not be applied to %s." % self.pretty_name)
diff --git a/python/pyspark/pandas/data_type_ops/num_ops.py 
b/python/pyspark/pandas/data_type_ops/num_ops.py
index a7987bc..f84c1af 100644
--- a/python/pyspark/pandas/data_type_ops/num_ops.py
+++ b/python/pyspark/pandas/data_type_ops/num_ops.py
@@ -326,6 +326,14 @@ class FractionalOps(NumericOps):
 ),
 )
 
+def nan_to_null(self, index_ops: IndexOpsLike) -> IndexOpsLike:
+# Special handle floating point types because Spark's count treats nan 
as a valid value,
+# whereas pandas count doesn't include nan.
+return index_ops._with_new_scol(
+F.nanvl(index_ops.spark.column, SF.lit(None)),
+field=index_ops._internal.data_fields[0].copy(nullable=True),
+)
+
 def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) 
-> IndexOpsLike:
 dtype, spark_type = pandas_on_spark_type(dtype)
 
@@ -385,6 +393,9 @@ class DecimalOps(FractionalOps):
 ),
 )
 
+def nan_to_null(self, index_ops: IndexOpsLike) -> IndexOpsLike:
+return index_ops.copy()
+
 def astype(self, index_ops: IndexOpsLike, dtype: Union[str, type, Dtype]) 
-> IndexOpsLike:
 # TODO(SPARK-36230): check index_ops.hasnans after fixing SPARK-36230
 dtype, spark_type = pandas_on_spark_type(dtype)
diff --git a/python/pyspark/pandas/frame.py b/python/pyspark/pandas/frame.py
index de675f1f..4a737b6 100644
--- a/python/pyspark/pandas/frame.py
+++ b/python/pyspark/pandas/frame.py
@@ -642,7 +642,7 @@ class DataFrame(Frame, Generic[T]):
 
 def _reduce_for_stat_function(
 self,
-sfun: Union[Callable[[Column], Column], Callable[[Column, DataType], 
Column]],
+sfun: Callable[["Series"], Column],
 name: str,
 axis: Optional[Axis] = None,
 numeric_only: bool = True,
@@ -664,7 +664,6 @@ class DataFrame(Frame, Generic[T]):
 is mainly for pandas compatibility. Only 'DataFrame.count' uses 
this parameter
 currently.
 """
-from inspect import signature
 from pyspark.pandas.series import Series, first_series
 
 axis = validate_axis(axis)
@@ -673,29 +672,19 @@ class DataFrame(Frame, Generic[T]):
 
 exprs = 
[SF.lit(None).cast(StringType()).alias(SPARK_DEFAULT_INDEX_NAME)]
 new_column_labels = []
-num_args = len(signature(sfun).parameters)
 for label in self._internal.column_labels:
-spark_column = self._internal.spark_column_for(label)
-spark_type = self._internal.spark_type_for(label)
+psser = self._psser_for(label)
 
-

[spark] branch master updated (801017f -> 895e3f5)

2021-07-30 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 801017f  [SPARK-36358][K8S] Upgrade Kubernetes Client Version to 5.6.0
 add 895e3f5  [SPARK-36350][PYTHON] Move some logic related to F.nanvl to 
DataTypeOps

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/data_type_ops/base.py|  3 ++
 python/pyspark/pandas/data_type_ops/num_ops.py | 11 +
 python/pyspark/pandas/frame.py | 50 +++
 python/pyspark/pandas/generic.py   | 66 --
 python/pyspark/pandas/groupby.py   | 66 +-
 python/pyspark/pandas/series.py| 22 +++--
 6 files changed, 112 insertions(+), 106 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (c6140d4 -> 801017f)

2021-07-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from c6140d4  [SPARK-36338][PYTHON][SQL] Move distributed-sequence 
implementation to Scala side
 add 801017f  [SPARK-36358][K8S] Upgrade Kubernetes Client Version to 5.6.0

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 42 -
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 42 -
 pom.xml |  2 +-
 3 files changed, 43 insertions(+), 43 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side

2021-07-30 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new fee87f1  [SPARK-36338][PYTHON][SQL] Move distributed-sequence 
implementation to Scala side
fee87f1 is described below

commit fee87f13d1be2fb018ff200d4f6dbfdb30f03a99
Author: Hyukjin Kwon 
AuthorDate: Fri Jul 30 22:29:23 2021 +0900

[SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to 
Scala side

### What changes were proposed in this pull request?

This PR proposes to implement `distributed-sequence` index in Scala side.

### Why are the changes needed?

- Avoid unnecessary (de)serialization
- Keep the nullability in the input DataFrame when `distributed-sequence` 
is enabled. During the serialization, all fields are being nullable for now 
(see https://github.com/apache/spark/pull/32775#discussion_r645882104)

### Does this PR introduce _any_ user-facing change?

No to end users since pandas API on Spark is not released yet.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(1).spark.print_schema()
```

Before:

```
root
 |-- id: long (nullable = true)
```

After:

```
root
 |-- id: long (nullable = false)
```

### How was this patch tested?

Manually tested, and existing tests should cover them.

Closes #33570 from HyukjinKwon/SPARK-36338.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit c6140d4d0af7f34152c1b681110a077be36f4068)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/accessors.py |  24 +---
 python/pyspark/pandas/config.py|   2 +-
 python/pyspark/pandas/indexes/base.py  |  16 +--
 python/pyspark/pandas/indexes/multi.py |   2 +-
 python/pyspark/pandas/indexing.py  |  16 +--
 python/pyspark/pandas/internal.py  | 150 +++--
 python/pyspark/pandas/series.py|  12 +-
 python/pyspark/pandas/tests/test_dataframe.py  |  37 ++---
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  25 
 .../org/apache/spark/sql/DataFrameSuite.scala  |   6 +
 10 files changed, 89 insertions(+), 201 deletions(-)

diff --git a/python/pyspark/pandas/accessors.py 
b/python/pyspark/pandas/accessors.py
index 6454938..d71ded1 100644
--- a/python/pyspark/pandas/accessors.py
+++ b/python/pyspark/pandas/accessors.py
@@ -169,7 +169,7 @@ class PandasOnSparkFrameMethods(object):
 for scol, label in zip(internal.data_spark_columns, 
internal.column_labels)
 ]
 )
-sdf, force_nullable = attach_func(sdf, name_like_string(column))
+sdf = attach_func(sdf, name_like_string(column))
 
 return DataFrame(
 InternalFrame(
@@ -178,28 +178,18 @@ class PandasOnSparkFrameMethods(object):
 scol_for(sdf, SPARK_INDEX_NAME_FORMAT(i)) for i in 
range(internal.index_level)
 ],
 index_names=internal.index_names,
-index_fields=(
-[field.copy(nullable=True) for field in 
internal.index_fields]
-if force_nullable
-else internal.index_fields
-),
+index_fields=internal.index_fields,
 column_labels=internal.column_labels + [column],
 data_spark_columns=(
 [scol_for(sdf, name_like_string(label)) for label in 
internal.column_labels]
 + [scol_for(sdf, name_like_string(column))]
 ),
-data_fields=(
-(
-[field.copy(nullable=True) for field in 
internal.data_fields]
-if force_nullable
-else internal.data_fields
+data_fields=internal.data_fields
++ [
+InternalField.from_struct_field(
+StructField(name_like_string(column), LongType(), 
nullable=False)
 )
-+ [
-InternalField.from_struct_field(
-StructField(name_like_string(column), LongType(), 
nullable=False)
-)
-]
-),
+],
 column_label_names=internal.column_label_names,
 ).resolved_copy
 )
diff --git a/python/pyspark/pandas/config.py b/python/pyspark/pandas/config.py
index b03f2e1..10acc8c 100644
--- a/python/pyspark/pandas/config.py
+++ b/python/pyspark/pandas/config.py
@@ -175,7 +175,7 @@ _options = [
 Option(
 

[spark] branch master updated (dd2ca0a -> c6140d4)

2021-07-30 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from dd2ca0a  [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in 
pandas on Spark
 add c6140d4  [SPARK-36338][PYTHON][SQL] Move distributed-sequence 
implementation to Scala side

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/accessors.py |  24 +---
 python/pyspark/pandas/config.py|   2 +-
 python/pyspark/pandas/indexes/base.py  |  16 +--
 python/pyspark/pandas/indexes/multi.py |   2 +-
 python/pyspark/pandas/indexing.py  |  16 +--
 python/pyspark/pandas/internal.py  | 150 +++--
 python/pyspark/pandas/series.py|  12 +-
 python/pyspark/pandas/tests/test_dataframe.py  |  37 ++---
 .../main/scala/org/apache/spark/sql/Dataset.scala  |  25 
 .../org/apache/spark/sql/DataFrameSuite.scala  |   6 +
 10 files changed, 89 insertions(+), 201 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark

2021-07-30 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 9cd3708  [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in 
pandas on Spark
9cd3708 is described below

commit 9cd370894baafb59d76d269595f6fdbaf94c76b8
Author: Hyukjin Kwon 
AuthorDate: Fri Jul 30 22:28:19 2021 +0900

[SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on 
Spark

### What changes were proposed in this pull request?

This PR is a partial revert of https://github.com/apache/spark/pull/33567 
that keeps the logic to skip mlflow related tests if that's not installed.

### Why are the changes needed?

It's consistent with other libraries, e.g) PyArrow.
It also fixes up the potential dev breakage (see also 
https://github.com/apache/spark/pull/33567#issuecomment-889841829)

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

This is a partial revert. CI should test it out too.

Closes #33589 from HyukjinKwon/SPARK-36254.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit dd2ca0aee25f6da1a1cf36ea8c7e0095420b7c00)
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/mlflow.py | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/pandas/mlflow.py b/python/pyspark/pandas/mlflow.py
index 4e48369..719db40 100644
--- a/python/pyspark/pandas/mlflow.py
+++ b/python/pyspark/pandas/mlflow.py
@@ -229,4 +229,10 @@ def _test() -> None:
 
 
 if __name__ == "__main__":
-_test()
+try:
+import mlflow  # noqa: F401
+import sklearn  # noqa: F401
+
+_test()
+except ImportError:
+pass

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (387a251 -> dd2ca0a)

2021-07-30 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 387a251  [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown
 add dd2ca0a  [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in 
pandas on Spark

No new revisions were added by this update.

Summary of changes:
 python/pyspark/pandas/mlflow.py | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[GitHub] [spark-website] tgravescs commented on a change in pull request #350: [SPARK-36335] Add local-cluster docs to developer-tools.md

2021-07-30 Thread GitBox


tgravescs commented on a change in pull request #350:
URL: https://github.com/apache/spark-website/pull/350#discussion_r679883614



##
File path: site/developer-tools.html
##
@@ -793,6 +793,18 @@ In Spark unit tests
 The platform-specific paths to the profiler agents are listed in the 
 http://www.yourkit.com/docs/80/help/agent.jsp;>YourKit 
documentation.
 
+
+Local-cluster mode
+
+When launching applications with spark-submit, besides options in 
+https://spark.apache.org/docs/latest/submitting-applications.html#master-urls;>Master
 URLs
+, set local-cluster option to emulate a distributed cluster in a single 
JVM.

Review comment:
   shouldn't we explicitly say this is for unit testing only?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown

2021-07-30 Thread viirya
This is an automated email from the ASF dual-hosted git repository.

viirya pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new f6bb75b  [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown
f6bb75b is described below

commit f6bb75b0bcae8c0bccf361dfd3710ce5f17173d5
Author: Wenchen Fan 
AuthorDate: Fri Jul 30 00:26:32 2021 -0700

[SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown

### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/33352 , to 
simplify the JDBC aggregate pushdown:
1. We should get the schema of the aggregate query by asking the JDBC 
server, instead of calculating it by ourselves. This can simplify the code a 
lot, and is also more robust: the data type of SUM may vary in different 
databases, it's fragile to assume they are always the same as Spark.
2. because of 1, now we can remove the `dataType` property from the public 
`Sum` expression.

This PR also contains some small improvements:
1. Spark should deduplicate the aggregate expressions before pushing them 
down.
2. Improve the `toString` of public aggregate expressions to make them more 
SQL.

### Why are the changes needed?

code and API simplification

### Does this PR introduce _any_ user-facing change?

this API is not released yet.

### How was this patch tested?

existing tests

Closes #33579 from cloud-fan/dsv2.

Authored-by: Wenchen Fan 
Signed-off-by: Liang-Chi Hsieh 
(cherry picked from commit 387a251a682a596ba4156b7d12e6025762ebac85)
Signed-off-by: Liang-Chi Hsieh 
---
 .../spark/sql/connector/expressions/Count.java |  8 +-
 .../spark/sql/connector/expressions/CountStar.java |  2 +-
 .../spark/sql/connector/expressions/Max.java   |  2 +-
 .../spark/sql/connector/expressions/Min.java   |  2 +-
 .../spark/sql/connector/expressions/Sum.java   | 12 +--
 .../execution/datasources/DataSourceStrategy.scala |  3 +-
 .../sql/execution/datasources/jdbc/JDBCRDD.scala   | 45 +--
 .../execution/datasources/jdbc/JDBCRelation.scala  |  7 +-
 .../execution/datasources/v2/PushDownUtils.scala   |  2 +-
 .../datasources/v2/V2ScanRelationPushDown.scala| 24 --
 .../execution/datasources/v2/jdbc/JDBCScan.scala   | 12 ++-
 .../datasources/v2/jdbc/JDBCScanBuilder.scala  | 88 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 30 
 13 files changed, 121 insertions(+), 116 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java
index 0e28a93..fecde71 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Count.java
@@ -38,7 +38,13 @@ public final class Count implements AggregateFunc {
   public boolean isDistinct() { return isDistinct; }
 
   @Override
-  public String toString() { return "Count(" + column.describe() + "," + 
isDistinct + ")"; }
+  public String toString() {
+if (isDistinct) {
+  return "COUNT(DISTINCT " + column.describe() + ")";
+} else {
+  return "COUNT(" + column.describe() + ")";
+}
+  }
 
   @Override
   public String describe() { return this.toString(); }
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java
index 21a3564..8e799cd 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/CountStar.java
@@ -31,7 +31,7 @@ public final class CountStar implements AggregateFunc {
   }
 
   @Override
-  public String toString() { return "CountStar()"; }
+  public String toString() { return "COUNT(*)"; }
 
   @Override
   public String describe() { return this.toString(); }
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java
index d2ff6b2..3ce45ca 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/Max.java
@@ -33,7 +33,7 @@ public final class Max implements AggregateFunc {
   public FieldReference column() { return column; }
 
   @Override
-  public String toString() { return "Max(" + column.describe() + ")"; }
+  public String toString() { return "MAX(" + column.describe() + ")"; }
 
   @Override
   public String describe() { return this.toString(); }
diff --git 

[spark] branch master updated (abce61f -> 387a251)

2021-07-30 Thread viirya
This is an automated email from the ASF dual-hosted git repository.

viirya pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from abce61f  [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions 
CI
 add 387a251  [SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/connector/expressions/Count.java |  8 +-
 .../spark/sql/connector/expressions/CountStar.java |  2 +-
 .../spark/sql/connector/expressions/Max.java   |  2 +-
 .../spark/sql/connector/expressions/Min.java   |  2 +-
 .../spark/sql/connector/expressions/Sum.java   | 12 +--
 .../execution/datasources/DataSourceStrategy.scala |  3 +-
 .../sql/execution/datasources/jdbc/JDBCRDD.scala   | 45 +--
 .../execution/datasources/jdbc/JDBCRelation.scala  |  7 +-
 .../execution/datasources/v2/PushDownUtils.scala   |  2 +-
 .../datasources/v2/V2ScanRelationPushDown.scala| 24 --
 .../execution/datasources/v2/jdbc/JDBCScan.scala   | 12 ++-
 .../datasources/v2/jdbc/JDBCScanBuilder.scala  | 88 ++
 .../org/apache/spark/sql/jdbc/JDBCV2Suite.scala| 30 
 13 files changed, 121 insertions(+), 116 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI

2021-07-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new a9c5b1a5 [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions 
CI
a9c5b1a5 is described below

commit a9c5b1a5c85a584ad866badcb35067713139b0bc
Author: itholic 
AuthorDate: Fri Jul 30 00:04:48 2021 -0700

[SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI

### What changes were proposed in this pull request?

This PR proposes adding a Python package, `mlflow` and `sklearn` to enable 
the MLflow test in pandas API on Spark.

### Why are the changes needed?

To enable the MLflow test in pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

No, it's test-only

### How was this patch tested?

Manually test on local, with `python/run-tests --testnames 
pyspark.pandas.mlflow`.

Closes #33567 from itholic/SPARK-36254.

Lead-authored-by: itholic 
Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit abce61f3fda73e865a80e9c38bf9ca471a6a5db8)
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 2 ++
 dev/requirements.txt | 3 ++-
 python/pyspark/pandas/mlflow.py  | 8 +---
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index cfc20ac..3eb12f5 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -227,6 +227,8 @@ jobs:
 # Run the tests.
 - name: Run tests
   run: |
+# TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of 
the base image
+python3.9 -m pip install 'mlflow>=1.0' sklearn
 export PATH=$PATH:$HOME/miniconda/bin
 ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST"
 - name: Upload test results to report
diff --git a/dev/requirements.txt b/dev/requirements.txt
index f5d662b..34f4b88 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -7,7 +7,8 @@ pyarrow
 pandas
 scipy
 plotly
-mlflow
+mlflow>=1.0
+sklearn
 matplotlib<3.3.0
 
 # PySpark test dependencies
diff --git a/python/pyspark/pandas/mlflow.py b/python/pyspark/pandas/mlflow.py
index 719db40..4e48369 100644
--- a/python/pyspark/pandas/mlflow.py
+++ b/python/pyspark/pandas/mlflow.py
@@ -229,10 +229,4 @@ def _test() -> None:
 
 
 if __name__ == "__main__":
-try:
-import mlflow  # noqa: F401
-import sklearn  # noqa: F401
-
-_test()
-except ImportError:
-pass
+_test()

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI

2021-07-30 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new abce61f  [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions 
CI
abce61f is described below

commit abce61f3fda73e865a80e9c38bf9ca471a6a5db8
Author: itholic 
AuthorDate: Fri Jul 30 00:04:48 2021 -0700

[SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI

### What changes were proposed in this pull request?

This PR proposes adding a Python package, `mlflow` and `sklearn` to enable 
the MLflow test in pandas API on Spark.

### Why are the changes needed?

To enable the MLflow test in pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

No, it's test-only

### How was this patch tested?

Manually test on local, with `python/run-tests --testnames 
pyspark.pandas.mlflow`.

Closes #33567 from itholic/SPARK-36254.

Lead-authored-by: itholic 
Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun 
---
 .github/workflows/build_and_test.yml | 2 ++
 dev/requirements.txt | 3 ++-
 python/pyspark/pandas/mlflow.py  | 8 +---
 3 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index f3a6363..17908ff 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -252,6 +252,8 @@ jobs:
 # Run the tests.
 - name: Run tests
   run: |
+# TODO(SPARK-36345): Install mlflow>=1.0 and sklearn in Python 3.9 of 
the base image
+python3.9 -m pip install 'mlflow>=1.0' sklearn
 export PATH=$PATH:$HOME/miniconda/bin
 ./dev/run-tests --parallelism 1 --modules "$MODULES_TO_TEST"
 - name: Upload test results to report
diff --git a/dev/requirements.txt b/dev/requirements.txt
index f5d662b..34f4b88 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -7,7 +7,8 @@ pyarrow
 pandas
 scipy
 plotly
-mlflow
+mlflow>=1.0
+sklearn
 matplotlib<3.3.0
 
 # PySpark test dependencies
diff --git a/python/pyspark/pandas/mlflow.py b/python/pyspark/pandas/mlflow.py
index 719db40..4e48369 100644
--- a/python/pyspark/pandas/mlflow.py
+++ b/python/pyspark/pandas/mlflow.py
@@ -229,10 +229,4 @@ def _test() -> None:
 
 
 if __name__ == "__main__":
-try:
-import mlflow  # noqa: F401
-import sklearn  # noqa: F401
-
-_test()
-except ImportError:
-pass
+_test()

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org