[spark] branch master updated: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

zero323 Tue, 26 Oct 2021 01:33:53 -0700

This is an automated email from the ASF dual-hosted git repository.

zero323 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 6d3cfed  [SPARK-36989][TESTS][PYTHON] Add type hints data tests
6d3cfed is described below

commit 6d3cfed0e8ab2812bc44bfcd5b82d8d227e77a29
Author: zero323 <mszymkiew...@gmail.com>
AuthorDate: Tue Oct 26 10:32:34 2021 +0200

    [SPARK-36989][TESTS][PYTHON] Add type hints data tests
    
    ### What changes were proposed in this pull request?
    
    This PR:
    
    - Adds basic data test runner to `dev/lint-python`, using 
[`typeddjango/pytest-mypy-plugins`](https://github.com/typeddjango/pytest-mypy-plugins)
    - Migrates data test cases from `pyspark-stubs`
    
    In case of failure, a message similar to the following one
    
    ```
    starting mypy annotations test...
    annotations passed mypy checks.
    
    starting mypy data test...
    annotations failed data checks:
    ============================= test session starts 
==============================
    platform linux -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
    rootdir: /path/to/spark/python, configfile: pyproject.toml
    plugins: mypy-plugins-1.9.2
    collected 37 items
    
    python/pyspark/ml/tests/typing/test_classification.yml ..                [  
5%]
    python/pyspark/ml/tests/typing/test_evaluation.yml .                     [  
8%]
    python/pyspark/ml/tests/typing/test_feature.yml .                        [ 
10%]
    python/pyspark/ml/tests/typing/test_param.yml .                          [ 
13%]
    python/pyspark/ml/tests/typing/test_readable.yml .                       [ 
16%]
    python/pyspark/ml/tests/typing/test_regression.yml ..                    [ 
21%]
    python/pyspark/sql/tests/typing/test_column.yml F                        [ 
24%]
    python/pyspark/sql/tests/typing/test_dataframe.yml .......               [ 
43%]
    python/pyspark/sql/tests/typing/test_functions.yml .                     [ 
45%]
    python/pyspark/sql/tests/typing/test_pandas_compatibility.yml ..         [ 
51%]
    python/pyspark/sql/tests/typing/test_readwriter.yml ..                   [ 
56%]
    python/pyspark/sql/tests/typing/test_session.yml .....                   [ 
70%]
    python/pyspark/sql/tests/typing/test_udf.yml .......                     [ 
89%]
    python/pyspark/tests/typing/test_context.yml .                           [ 
91%]
    python/pyspark/tests/typing/test_core.yml .                              [ 
94%]
    python/pyspark/tests/typing/test_rdd.yml .                               [ 
97%]
    python/pyspark/tests/typing/test_resultiterable.yml .                    
[100%]
    
    =================================== FAILURES 
===================================
    ______________________________ colDateTimeCompare 
______________________________
    /path/to/spark/python/pyspark/sql/tests/typing/test_column.yml:39:
    E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
    E   Actual:
    E     main:20: note: Revealed type is "pyspark.sql.column.Column" (diff)
    E   Expected:
    E     main:20: note: Revealed type is "datetime.date*" (diff)
    E   Alignment of first line difference:
    E     E: ...ote: Revealed type is "datetime.date*"
    E     A: ...ote: Revealed type is "pyspark.sql.column.Column"
    E                                  ^
    =========================== short test summary info 
============================
    FAILED python/pyspark/sql/tests/typing/test_column.yml::colDateTimeCompare -
    ======================== 1 failed, 36 passed in 56.13s 
=========================
    ```
    
    will be displayed.
    
    ### Why are the changes needed?
    
    Currently, type annotations are tested primarily for integrity and, to 
lesser extent, against actual API. Testing against examples is work in progress 
(SPARK-36997).  Data tests allow us to improve coverage and test negative cases 
(code, that should fail type checker validation).
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Running linter tests with additions proposed in this PR
    
    Closes #34296 from zero323/SPARK-36989.
    
    Authored-by: zero323 <mszymkiew...@gmail.com>
    Signed-off-by: zero323 <mszymkiew...@gmail.com>
---
 .github/workflows/build_and_test.yml               |   2 +
 dev/lint-python                                    |  75 +++++++--
 python/pyproject.toml                              |  25 +++
 .../ml/tests/typing/test_classification.yml        |  38 +++++
 python/pyspark/ml/tests/typing/test_evaluation.yml |  26 ++++
 python/pyspark/ml/tests/typing/test_feature.yml    |  44 ++++++
 python/pyspark/ml/tests/typing/test_param.yml      |  30 ++++
 python/pyspark/ml/tests/typing/test_readable.yml   |  28 ++++
 python/pyspark/ml/tests/typing/test_regression.yml |  38 +++++
 python/pyspark/sql/tests/typing/test_column.yml    |  37 +++++
 python/pyspark/sql/tests/typing/test_dataframe.yml | 140 +++++++++++++++++
 python/pyspark/sql/tests/typing/test_functions.yml |  90 +++++++++++
 .../sql/tests/typing/test_pandas_compatibility.yml |  35 +++++
 .../pyspark/sql/tests/typing/test_readwriter.yml   |  45 ++++++
 python/pyspark/sql/tests/typing/test_session.yml   | 113 ++++++++++++++
 python/pyspark/sql/tests/typing/test_udf.yml       | 170 +++++++++++++++++++++
 python/pyspark/tests/typing/test_context.yml       |  21 +++
 python/pyspark/tests/typing/test_core.yml          |  20 +++
 python/pyspark/tests/typing/test_rdd.yml           |  62 ++++++++
 .../pyspark/tests/typing/test_resultiterable.yml   |  22 +++
 20 files changed, 1047 insertions(+), 14 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index f586d55..3f2d500 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -442,6 +442,8 @@ jobs:
         # Jinja2 3.0.0+ causes error when building with Sphinx.
         #   See also https://issues.apache.org/jira/browse/SPARK-35375.
         python3.9 -m pip install flake8 pydata_sphinx_theme 'mypy==0.910' 
numpydoc 'jinja2<3.0.0' 'black==21.5b2'
+        # TODO Update to PyPI
+        python3.9 -m pip install 
git+https://github.com/typeddjango/pytest-mypy-plugins.git@b0020061f48e85743ee3335bd62a3a608d17c6bd
     - name: Install R linter dependencies and SparkR
       run: |
         apt-get install -y libcurl4-openssl-dev libgit2-dev libssl-dev 
libxml2-dev
diff --git a/dev/lint-python b/dev/lint-python
index 0639ff6..f04cd1a 100755
--- a/dev/lint-python
+++ b/dev/lint-python
@@ -22,6 +22,7 @@ MINIMUM_MYPY="0.910"
 MYPY_BUILD="mypy"
 PYCODESTYLE_BUILD="pycodestyle"
 MINIMUM_PYCODESTYLE="2.7.0"
+PYTEST_BUILD="pytest"
 
 PYTHON_EXECUTABLE="${PYTHON_EXECUTABLE:-python3}"
 
@@ -123,10 +124,66 @@ function pycodestyle_test {
     fi
 }
 
-function mypy_test {
+
+function mypy_annotation_test {
     local MYPY_REPORT=
     local MYPY_STATUS=
 
+    echo "starting mypy annotations test..."
+    MYPY_REPORT=$( ($MYPY_BUILD \
+      --config-file python/mypy.ini \
+      --cache-dir /tmp/.mypy_cache/ \
+      python/pyspark) 2>&1)
+    MYPY_STATUS=$?
+
+    if [ "$MYPY_STATUS" -ne 0 ]; then
+        echo "annotations failed mypy checks:"
+        echo "$MYPY_REPORT"
+        echo "$MYPY_STATUS"
+        exit "$MYPY_STATUS"
+    else
+        echo "annotations passed mypy checks."
+        echo
+    fi
+}
+
+
+function mypy_data_test {
+    local PYTEST_REPORT=
+    local PYTEST_STATUS=
+
+    echo "starting mypy data test..."
+
+    $PYTHON_EXECUTABLE -c "import importlib.util; import sys; \
+               sys.exit(0 if importlib.util.find_spec('pytest_mypy_plugins') 
else 1)"
+
+    if [ $? -ne 0 ]; then
+      echo "pytest-mypy-plugins missing. Skipping for now."
+      return
+    fi
+
+    PYTEST_REPORT=$( (MYPYPATH=python $PYTEST_BUILD \
+      -c python/pyproject.toml \
+      --rootdir python \
+      --mypy-only-local-stub \
+      --mypy-ini-file python/mypy.ini \
+      python/pyspark ) 2>&1)
+
+    PYTEST_STATUS=$?
+
+    if [ "$PYTEST_STATUS" -ne 0 ]; then
+        echo "annotations failed data checks:"
+        echo "$PYTEST_REPORT"
+        echo "$PYTEST_STATUS"
+        exit "$PYTEST_STATUS"
+    else
+      echo "annotations passed data checks."
+      echo
+    fi
+}
+
+
+function mypy_test {
     if ! hash "$MYPY_BUILD" 2> /dev/null; then
         echo "The $MYPY_BUILD command was not found. Skipping for now."
         return
@@ -141,21 +198,11 @@ function mypy_test {
         return
     fi
 
-    echo "starting mypy test..."
-    MYPY_REPORT=$( ($MYPY_BUILD --config-file python/mypy.ini python/pyspark) 
2>&1)
-    MYPY_STATUS=$?
-
-    if [ "$MYPY_STATUS" -ne 0 ]; then
-        echo "mypy checks failed:"
-        echo "$MYPY_REPORT"
-        echo "$MYPY_STATUS"
-        exit "$MYPY_STATUS"
-    else
-        echo "mypy checks passed."
-        echo
-    fi
+    mypy_annotation_test
+    mypy_data_test
 }
 
+
 function flake8_test {
     local FLAKE8_VERSION=
     local EXPECTED_FLAKE8=
diff --git a/python/pyproject.toml b/python/pyproject.toml
new file mode 100644
index 0000000..286b728
--- /dev/null
+++ b/python/pyproject.toml
@@ -0,0 +1,25 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+[tool.pytest.ini_options]
+# Pytest it used only to run mypy data tests
+python_files = "test_*.yml"
+testpaths = [
+  "pyspark/tests/typing",
+  "pyspark/sql/tests/typing",
+  "pyspark/ml/typing",
+]
diff --git a/python/pyspark/ml/tests/typing/test_classification.yml 
b/python/pyspark/ml/tests/typing/test_classification.yml
new file mode 100644
index 0000000..a6efc76
--- /dev/null
+++ b/python/pyspark/ml/tests/typing/test_classification.yml
@@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: oneVsRest
+  main: |
+    from pyspark.ml.classification import (
+        OneVsRest, OneVsRestModel, LogisticRegression, LogisticRegressionModel
+    )
+
+    # Should support
+    OneVsRest(classifier=LogisticRegression())
+    OneVsRest(classifier=LogisticRegressionModel.load("/foo"))  # E: Argument 
"classifier" to "OneVsRest" has incompatible type "LogisticRegressionModel"; 
expected "Optional[Estimator[<nothing>]]"  [arg-type]
+    OneVsRest(classifier="foo")  # E: Argument "classifier" to "OneVsRest" has 
incompatible type "str"; expected "Optional[Estimator[<nothing>]]"  [arg-type]
+
+
+- case: fitFMClassifier
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.ml.classification import FMClassifier, FMClassificationModel
+
+    spark = SparkSession.builder.getOrCreate()
+    fm_model: FMClassificationModel = 
FMClassifier().fit(spark.read.parquet("/foo"))
+    fm_model.linear.toArray()
+    fm_model.factors.numRows
diff --git a/python/pyspark/ml/tests/typing/test_evaluation.yml 
b/python/pyspark/ml/tests/typing/test_evaluation.yml
new file mode 100644
index 0000000..e9e8f20
--- /dev/null
+++ b/python/pyspark/ml/tests/typing/test_evaluation.yml
@@ -0,0 +1,26 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: BinaryClassificationEvaluator
+  main: |
+    from pyspark.ml.evaluation import BinaryClassificationEvaluator
+
+    BinaryClassificationEvaluator().setMetricName("areaUnderROC")
+    BinaryClassificationEvaluator(metricName="areaUnderPR")
+
+    BinaryClassificationEvaluator().setMetricName("foo")  # E: Argument 1 to 
"setMetricName" of "BinaryClassificationEvaluator" has incompatible type 
"Literal['foo']"; expected "Union[Literal['areaUnderROC'], 
Literal['areaUnderPR']]"  [arg-type]
+    BinaryClassificationEvaluator(metricName="bar")  # E: Argument 
"metricName" to "BinaryClassificationEvaluator" has incompatible type 
"Literal['bar']"; expected "Union[Literal['areaUnderROC'], 
Literal['areaUnderPR']]"  [arg-type]
diff --git a/python/pyspark/ml/tests/typing/test_feature.yml 
b/python/pyspark/ml/tests/typing/test_feature.yml
new file mode 100644
index 0000000..3d6b090
--- /dev/null
+++ b/python/pyspark/ml/tests/typing/test_feature.yml
@@ -0,0 +1,44 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: stringIndexerOverloads
+  main: |
+    from pyspark.ml.feature import StringIndexer
+
+    # No arguments is OK
+    StringIndexer()
+
+    StringIndexer(inputCol="foo")
+    StringIndexer(outputCol="bar")
+    StringIndexer(inputCol="foo", outputCol="bar")
+
+    StringIndexer(inputCols=["foo"])
+    StringIndexer(outputCols=["bar"])
+    StringIndexer(inputCols=["foo"], outputCols=["bar"])
+
+    StringIndexer(inputCol="foo", outputCols=["bar"])
+    StringIndexer(inputCols=["foo"], outputCol="bar")
+
+  out: |
+    main:14: error: No overload variant of "StringIndexer" matches argument 
types "str", "List[str]"  [call-overload]
+    main:14: note: Possible overload variants:
+    main:14: note:     def StringIndexer(self, *, inputCol: Optional[str] = 
..., outputCol: Optional[str] = ..., handleInvalid: str = ..., stringOrderType: 
str = ...) -> StringIndexer
+    main:14: note:     def StringIndexer(self, *, inputCols: 
Optional[List[str]] = ..., outputCols: Optional[List[str]] = ..., 
handleInvalid: str = ..., stringOrderType: str = ...) -> StringIndexer
+    main:15: error: No overload variant of "StringIndexer" matches argument 
types "List[str]", "str"  [call-overload]
+    main:15: note: Possible overload variants:
+    main:15: note:     def StringIndexer(self, *, inputCol: Optional[str] = 
..., outputCol: Optional[str] = ..., handleInvalid: str = ..., stringOrderType: 
str = ...) -> StringIndexer
+    main:15: note:     def StringIndexer(self, *, inputCols: 
Optional[List[str]] = ..., outputCols: Optional[List[str]] = ..., 
handleInvalid: str = ..., stringOrderType: str = ...) -> StringIndexer
\ No newline at end of file
diff --git a/python/pyspark/ml/tests/typing/test_param.yml 
b/python/pyspark/ml/tests/typing/test_param.yml
new file mode 100644
index 0000000..0b423f0
--- /dev/null
+++ b/python/pyspark/ml/tests/typing/test_param.yml
@@ -0,0 +1,30 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: paramGenric
+  main: |
+    from pyspark.ml.param import Param, Params, TypeConverters
+
+    class Foo(Params):
+        foo = Param(Params(), "foo", "foo", TypeConverters.toInt)
+        def getFoo(self) -> int:
+            return self.getOrDefault(self.foo)
+
+    class Bar(Params):
+        bar = Param(Params(), "bar", "bar", TypeConverters.toInt)
+        def getFoo(self) -> str:
+            return self.getOrDefault(self.bar)  # E: Incompatible return value 
type (got "int", expected "str")  [return-value]
diff --git a/python/pyspark/ml/tests/typing/test_readable.yml 
b/python/pyspark/ml/tests/typing/test_readable.yml
new file mode 100644
index 0000000..772133a
--- /dev/null
+++ b/python/pyspark/ml/tests/typing/test_readable.yml
@@ -0,0 +1,28 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: readLinearSVCModel
+  main: |
+    from pyspark.ml.classification import LinearSVCModel
+
+    model1 = LinearSVCModel.load("dummy")
+    model1.coefficients.toArray()
+    model1.foo()  # E: "LinearSVCModel" has no attribute "foo"  [attr-defined]
+
+    model2 = LinearSVCModel.read().load("dummy")
+    model2.coefficients.toArray()
+    model2.foo()  # E: "LinearSVCModel" has no attribute "foo"  [attr-defined]
diff --git a/python/pyspark/ml/tests/typing/test_regression.yml 
b/python/pyspark/ml/tests/typing/test_regression.yml
new file mode 100644
index 0000000..b045bec
--- /dev/null
+++ b/python/pyspark/ml/tests/typing/test_regression.yml
@@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: loadFMRegressor
+  main: |
+    from pyspark.ml.regression import FMRegressor, FMRegressionModel
+
+    fm = FMRegressor.load("/foo")
+    fm.setMiniBatchFraction(0.1)
+
+    fm_model = FMRegressionModel.load("/bar")
+    fm_model.factors.numCols
+
+    fm_model.foo()  # E: "FMRegressionModel" has no attribute "foo"  
[attr-defined]
+
+
+- case: loadLinearRegressor
+  main: |
+    from pyspark.ml.regression import LinearRegressionModel
+
+    lr_model = LinearRegressionModel.load("/foo")
+    lr_model.getLabelCol().upper()
+
+    lr_model.foo  # E: "LinearRegressionModel" has no attribute "foo"  
[attr-defined]
diff --git a/python/pyspark/sql/tests/typing/test_column.yml 
b/python/pyspark/sql/tests/typing/test_column.yml
new file mode 100644
index 0000000..26eeb61
--- /dev/null
+++ b/python/pyspark/sql/tests/typing/test_column.yml
@@ -0,0 +1,37 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: colDateTimeCompare
+  main: |
+    import datetime
+    from pyspark.sql.functions import col
+
+    today = datetime.date.today()
+    now = datetime.datetime.now()
+    a_col = col("")
+
+    a_col < today
+    a_col <= today
+    a_col == today
+    a_col >= today
+    a_col > today
+
+    a_col < now
+    a_col <= now
+    a_col == now
+    a_col >= now
+    a_col > now
diff --git a/python/pyspark/sql/tests/typing/test_dataframe.yml 
b/python/pyspark/sql/tests/typing/test_dataframe.yml
new file mode 100644
index 0000000..79a3bcd
--- /dev/null
+++ b/python/pyspark/sql/tests/typing/test_dataframe.yml
@@ -0,0 +1,140 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: sampling
+  main: |
+    from pyspark.sql import SparkSession
+
+    spark = SparkSession.builder.getOrCreate()
+    df = spark.range(1)
+    df.sample(1.0)
+    df.sample(0.5, 3)
+    df.sample(fraction=0.5, seed=3)
+    df.sample(withReplacement=True, fraction=0.5, seed=3)
+    df.sample(fraction=1.0)
+    df.sample(False, fraction=1.0)
+
+    # Will raise a runtime error, though not a typecheck error, as bool is
+    # duck-type compatible with float.
+    df.sample(True)
+
+    df.sample(withReplacement=False)
+
+  out: |
+    main:16: error: No overload variant of "sample" of "DataFrame" matches 
argument type "bool"  [call-overload]
+    main:16: note: Possible overload variants:
+    main:16: note:     def sample(self, fraction: float, seed: Optional[int] = 
...) -> DataFrame
+    main:16: note:     def sample(self, withReplacement: Optional[bool], 
fraction: float, seed: Optional[int] = ...) -> DataFrame
+
+
+- case: selectColumns
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.functions import col
+
+    spark = SparkSession.builder.getOrCreate()
+
+    data = [('Alice', 1)]
+    df = spark.createDataFrame(data, schema="name str, age int")
+
+    df.select(["name", "age"])
+    df.select([col("name"), col("age")])
+
+    df.select(["name", col("age")])  # E: Argument 1 to "select" of 
"DataFrame" has incompatible type "List[object]"; expected "Union[List[Column], 
List[str]]"  [arg-type]
+
+
+- case: groupBy
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.functions import col
+
+    spark = SparkSession.builder.getOrCreate()
+
+    data = [('Alice', 1)]
+    df = spark.createDataFrame(data, schema="name str, age int")
+
+    df.groupBy(["name", "age"])
+    df.groupby(["name", "age"])
+    df.groupBy([col("name"), col("age")])
+    df.groupby([col("name"), col("age")])
+    df.groupBy(["name", col("age")])  # E: Argument 1 to "groupBy" of 
"DataFrame" has incompatible type "List[object]"; expected "Union[List[Column], 
List[str]]"  [arg-type]
+
+
+- case: rollup
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.functions import col
+
+    spark = SparkSession.builder.getOrCreate()
+
+    data = [('Alice', 1)]
+    df = spark.createDataFrame(data, schema="name str, age int")
+
+    df.rollup(["name", "age"])
+    df.rollup([col("name"), col("age")])
+
+
+    df.rollup(["name", col("age")])  # E: Argument 1 to "rollup" of 
"DataFrame" has incompatible type "List[object]"; expected "Union[List[Column], 
List[str]]"  [arg-type]
+
+
+- case: cube
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.functions import col
+
+    spark = SparkSession.builder.getOrCreate()
+
+    data = [('Alice', 1)]
+    df = spark.createDataFrame(data, schema="name str, age int")
+
+    df.cube(["name", "age"])
+    df.cube([col("name"), col("age")])
+
+
+    df.cube(["name", col("age")])  # E: Argument 1 to "cube" of "DataFrame" 
has incompatible type "List[object]"; expected "Union[List[Column], List[str]]" 
 [arg-type]
+
+
+- case: dropColumns
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.functions import lit, col
+
+    spark = SparkSession.builder.getOrCreate()
+    df = spark.range(1)
+    df.drop("id")
+    df.drop("id", "foo")
+    df.drop(df.id)
+
+    df.drop(col("id"), col("foo"))
+
+  out: |
+    main:10: error: No overload variant of "drop" of "DataFrame" matches 
argument types "Column", "Column"  [call-overload]
+    main:10: note: Possible overload variant:
+    main:10: note:     def drop(self, *cols: str) -> DataFrame
+    main:10: note:     <1 more non-matching overload not shown>
+
+
+- case: fillNullValues
+  main: |
+    from pyspark.sql import SparkSession
+
+    spark = SparkSession.builder.getOrCreate()
+    df = spark.createDataFrame([(1,2)], schema=("id1", "id2"))
+
+    df.fillna(value=1, subset="id1")
+    df.fillna(value=1, subset=("id1", "id2"))
+    df.fillna(value=1, subset=["id1"])
diff --git a/python/pyspark/sql/tests/typing/test_functions.yml 
b/python/pyspark/sql/tests/typing/test_functions.yml
new file mode 100644
index 0000000..70f6fd9
--- /dev/null
+++ b/python/pyspark/sql/tests/typing/test_functions.yml
@@ -0,0 +1,90 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: varargFunctionsOverloads
+  expect_fail: true  # TODO: Remove once SPARK-37085 is resolved
+  main: |
+    from pyspark.sql.functions import (
+      array,
+      col,
+      create_map,
+      map_concat,
+      struct,
+    )
+
+    array(col("foo"), col("bar"))
+    array([col("foo"), col("bar")])
+    array("foo", "bar")
+    array(["foo", "bar"])
+
+    create_map(col("foo"), col("bar"))
+    create_map([col("foo"), col("bar")])
+    create_map("foo", "bar")
+    create_map(["foo", "bar"])
+
+    map_concat(col("foo"), col("bar"))
+    map_concat([col("foo"), col("bar")])
+    map_concat("foo", "bar")
+    map_concat(["foo", "bar"])
+
+    struct(col("foo"), col("bar"))
+    struct([col("foo"), col("bar")])
+    struct("foo", "bar")
+    struct(["foo", "bar"])
+
+    array([col("foo")], [col("bar")])
+    create_map([col("foo")], [col("bar")])
+    map_concat([col("foo")], [col("bar")])
+    struct(["foo"], ["bar"])
+    array(["foo"], ["bar"])
+    create_map(["foo"], ["bar"])
+    map_concat(["foo"], ["bar"])
+    struct(["foo"], ["bar"])
+
+  out: |
+    main.py:29: error: No overload variant of "array" matches argument types 
"List[Column]", "List[Column]"
+    main.py:29: note: Possible overload variant:
+    main.py:29: note:     def array(*cols: Union[Column, str]) -> Column
+    main.py:29: note:     <1 more non-matching overload not shown>
+    main.py:30: error: No overload variant of "create_map" matches argument 
types "List[Column]", "List[Column]"
+    main.py:30: note: Possible overload variant:
+    main.py:30: note:     def create_map(*cols: Union[Column, str]) -> Column
+    main.py:30: note:     <1 more non-matching overload not shown>
+    main.py:31: error: No overload variant of "map_concat" matches argument 
types "List[Column]", "List[Column]"
+    main.py:31: note: Possible overload variant:
+    main.py:31: note:     def map_concat(*cols: Union[Column, str]) -> Column
+    main.py:31: note:     <1 more non-matching overload not shown>
+    main.py:32: error: No overload variant of "struct" matches argument types 
"List[str]", "List[str]"
+    main.py:32: note: Possible overload variant:
+    main.py:32: note:     def struct(*cols: Union[Column, str]) -> Column
+    main.py:32: note:     <1 more non-matching overload not shown>
+    main.py:33: error: No overload variant of "array" matches argument types 
"List[str]", "List[str]"
+    main.py:33: note: Possible overload variant:
+    main.py:33: note:     def array(*cols: Union[Column, str]) -> Column
+    main.py:33: note:     <1 more non-matching overload not shown>
+    main.py:34: error: No overload variant of "create_map" matches argument 
types "List[str]", "List[str]"
+    main.py:34: note: Possible overload variant:
+    main.py:34: note:     def create_map(*cols: Union[Column, str]) -> Column
+    main.py:34: note:     <1 more non-matching overload not shown>
+    main.py:35: error: No overload variant of "map_concat" matches argument 
types "List[str]", "List[str]"
+    main.py:35: note: Possible overload variant:
+    main.py:35: note:     def map_concat(*cols: Union[Column, str]) -> Column
+    main.py:35: note:     <1 more non-matching overload not shown>
+    main.py:36: error: No overload variant of "struct" matches argument types 
"List[str]", "List[str]"
+    main.py:36: note: Possible overload variant:
+    main.py:36: note:     def struct(*cols: Union[Column, str]) -> Column
+    main.py:36: note:     <1 more non-matching overload not shown>
diff --git a/python/pyspark/sql/tests/typing/test_pandas_compatibility.yml 
b/python/pyspark/sql/tests/typing/test_pandas_compatibility.yml
new file mode 100644
index 0000000..00d44a7
--- /dev/null
+++ b/python/pyspark/sql/tests/typing/test_pandas_compatibility.yml
@@ -0,0 +1,35 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: pandasDataFrameProtocol
+  main: |
+    from pyspark.sql.pandas._typing import PandasDataFrame
+
+    df = PandasDataFrame({"a":  1})
+    df.groupby("a")
+
+    df.foo()  # E: "DataFrameLike" has no attribute "foo"  [attr-defined]
+
+
+- case: PandasSeriesProtocol
+  main: |
+    from pyspark.sql.pandas._typing import PandasSeries
+
+    series = PandasSeries([1, 2, 3])
+    series.nsmallest(3)
+
+    series.foo()  # E: "SeriesLike" has no attribute "foo"  [attr-defined]
diff --git a/python/pyspark/sql/tests/typing/test_readwriter.yml 
b/python/pyspark/sql/tests/typing/test_readwriter.yml
new file mode 100644
index 0000000..2ce3637
--- /dev/null
+++ b/python/pyspark/sql/tests/typing/test_readwriter.yml
@@ -0,0 +1,45 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: readWriterOptions
+  main: |
+    from pyspark.sql import SparkSession
+
+    spark = SparkSession.builder.getOrCreate()
+
+    spark.read.option("foo", 1)
+    spark.createDataFrame([(1, 2)], ["foo", "bar"]).write.option("bar", True)
+
+    spark.read.load(foo=True)
+
+    spark.read.load(foo=["a"])  # E: Argument "foo" to "load" of 
"DataFrameReader" has incompatible type "List[str]"; expected "Union[bool, 
float, int, str, None]"  [arg-type]
+    spark.read.option("foo", (1, ))  # E: Argument 2 to "option" of 
"DataFrameReader" has incompatible type "Tuple[int]"; expected "Union[bool, 
float, int, str, None]"  [arg-type]
+
+
+- case: readStreamOptions
+  main: |
+    from pyspark.sql import SparkSession
+
+    spark = SparkSession.builder.getOrCreate()
+
+    spark.read.option("foo", True).option("foo", 1).option("foo", 
1.0).option("foo", "1").option("foo", None)
+    spark.readStream.option("foo", True).option("foo", 1).option("foo", 
1.0).option("foo", "1").option("foo", None)
+
+    spark.read.options(foo=True, bar=1).options(foo=1.0, bar="1", baz=None)
+    spark.readStream.options(foo=True, bar=1).options(foo=1.0, bar="1", 
baz=None)
+
+    spark.readStream.load(foo=True)
diff --git a/python/pyspark/sql/tests/typing/test_session.yml 
b/python/pyspark/sql/tests/typing/test_session.yml
new file mode 100644
index 0000000..f06e79e
--- /dev/null
+++ b/python/pyspark/sql/tests/typing/test_session.yml
@@ -0,0 +1,113 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: createDataFrameStructsValid
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.types import StructType, StructField, StringType, 
IntegerType
+
+    spark = SparkSession.builder.getOrCreate()
+
+    data = [('Alice', 1)]
+    schema = StructType([
+        StructField("name", StringType(), True),
+        StructField("age", IntegerType(), True)
+    ])
+
+    # Valid structs
+    spark.createDataFrame(data)
+    spark.createDataFrame(data, samplingRatio=0.1)
+    spark.createDataFrame(data, ("name", "age"))
+    spark.createDataFrame(data, schema)
+    spark.createDataFrame(data, "name string, age integer")
+    spark.createDataFrame([(1, ("foo", "bar"))], ("_1", "_2"))
+    spark.createDataFrame(data, ("name", "age"), samplingRatio=0.1)  # type: 
ignore
+
+
+- case: createDataFrameScalarsValid
+  main: |
+
+    from pyspark.sql import SparkSession
+    from pyspark.sql.types import StructType, StructField, StringType, 
IntegerType
+
+    spark = SparkSession.builder.getOrCreate()
+
+    # Scalars
+    spark.createDataFrame([1, 2, 3], IntegerType())
+    spark.createDataFrame(["foo", "bar"], "string")
+
+
+- case: createDataFrameScalarsInvalid
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.types import StructType, StructField, StringType, 
IntegerType
+
+    spark = SparkSession.builder.getOrCreate()
+
+    schema = StructType([
+        StructField("name", StringType(), True),
+        StructField("age", IntegerType(), True)
+    ])
+
+    # Invalid - scalars require schema
+    spark.createDataFrame(["foo", "bar"]) # E: Value of type variable 
"RowLike" of "createDataFrame" of "SparkSession" cannot be "str"  [type-var]
+
+    # Invalid - data has to match schema (either product -> struct or scalar 
-> atomic)
+    spark.createDataFrame([1, 2, 3], schema) # E: Value of type variable 
"RowLike" of "createDataFrame" of "SparkSession" cannot be "int"  [type-var]
+
+
+- case: createDataFrameStructsInvalid
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.types import StructType, StructField, StringType, 
IntegerType
+
+    spark = SparkSession.builder.getOrCreate()
+
+    data = [('Alice', 1)]
+
+    schema = StructType([
+        StructField("name", StringType(), True),
+        StructField("age", IntegerType(), True)
+    ])
+
+    # Invalid product should have StructType schema
+    spark.createDataFrame(data, IntegerType())
+
+    # This shouldn't type check, though is technically speaking valid
+    # because  samplingRatio is ignored
+    spark.createDataFrame(data, schema, samplingRatio=0.1)
+
+  out: |
+    main:14: error: Argument 1 to "createDataFrame" of "SparkSession" has 
incompatible type "List[Tuple[str, int]]"; expected 
"Union[RDD[Union[Union[datetime, date], Union[bool, float, int, str], 
Decimal]], Iterable[Union[Union[datetime, date], Union[bool, float, int, str], 
Decimal]]]"  [arg-type]
+    main:18: error: No overload variant of "createDataFrame" of "SparkSession" 
matches argument types "List[Tuple[str, int]]", "StructType", "float"  
[call-overload]
+    main:18: note: Possible overload variants:
+    main:18: note:     def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
createDataFrame(self, data: Union[RDD[RowLike], Iterable[RowLike]], 
samplingRatio: Optional[float] = ...) -> DataFrame
+    main:18: note:     def [RowLike in (List[Any], Tuple[Any, ...], Row)] 
createDataFrame(self, data: Union[RDD[RowLike], Iterable[RowLike]], schema: 
Union[List[str], Tuple[str, ...]] = ..., verifySchema: bool = ...) -> DataFrame
+    main:18: note:     <4 more similar overloads not shown, out of 6 total 
overloads>
+
+
+- case: createDataFrameFromEmptyRdd
+  main: |
+    from pyspark.sql import SparkSession
+    from pyspark.sql.types import StructType
+
+    spark = SparkSession.builder.getOrCreate()
+
+    spark.createDataFrame(
+        spark.sparkContext.emptyRDD(),
+        schema=StructType(),
+    )
diff --git a/python/pyspark/sql/tests/typing/test_udf.yml 
b/python/pyspark/sql/tests/typing/test_udf.yml
new file mode 100644
index 0000000..3860830
--- /dev/null
+++ b/python/pyspark/sql/tests/typing/test_udf.yml
@@ -0,0 +1,170 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: scalarUDF
+  main: |
+    from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
+    import pandas.core.series  # type: ignore[import]
+    import pandas.core.frame  # type: ignore[import]
+
+    @pandas_udf("str", PandasUDFType.SCALAR)
+    def f(x: pandas.core.series.Series) -> pandas.core.series.Series:
+        return x
+
+    @pandas_udf("str", PandasUDFType.SCALAR)
+    def g(x: pandas.core.series.Series, y: pandas.core.series.Series) -> 
pandas.core.series.Series:
+        return x
+
+    @pandas_udf("str", PandasUDFType.SCALAR)
+    def h(*xs: pandas.core.series.Series) -> pandas.core.series.Series:
+        return xs[0]
+
+    @pandas_udf("str", PandasUDFType.SCALAR)
+    def k(x: pandas.core.frame.DataFrame, y: pandas.core.series.Series) -> 
pandas.core.series.Series:
+        return x
+
+    pandas_udf(lambda x: x, "str", PandasUDFType.SCALAR)
+    pandas_udf(lambda x, y: x, "str", PandasUDFType.SCALAR)
+    pandas_udf(lambda *xs: xs[0], "str", PandasUDFType.SCALAR)
+
+
+- case: scalarIterUDF
+  main: |
+    from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
+    from pyspark.sql.types import IntegerType
+    import pandas.core.series  # type: ignore[import]
+    from typing import Iterable
+
+    @pandas_udf(IntegerType(), PandasUDFType.SCALAR_ITER)
+    def f(xs: pandas.core.series.Series) -> 
Iterable[pandas.core.series.Series]:
+        for x in xs:
+            yield x + 1
+
+
+- case: groupedMapUdf
+  main: |
+    from typing import Any
+
+    from pyspark.sql.session import SparkSession
+    from pyspark.sql.types import StructField, StructType, LongType
+    from pyspark.sql.pandas.functions import pandas_udf, PandasUDFType
+    import pandas.core.frame  # type: ignore[import]
+
+    @pandas_udf("id long", PandasUDFType.GROUPED_MAP)
+    def f(pdf: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame:
+       return pdf
+
+    spark = SparkSession.builder.getOrCreate()
+
+    dfg = spark.range(1).groupBy("id")
+    dfg.apply(f)
+
+    @pandas_udf("id long", PandasUDFType.GROUPED_MAP)
+    def g(key: Any, pdf: pandas.core.frame.DataFrame) -> 
pandas.core.frame.DataFrame:
+       return pdf
+
+    dfg.apply(g)
+
+
+    def h(pdf: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame:
+       return pdf
+
+    dfg.applyInPandas(h, "id long")
+    dfg.applyInPandas(h, StructType([StructField("id", LongType())]))
+
+
+- case: groupedAggUDF
+  main: |
+    # Let's keep this one to make sure compatibility imports work
+    from pyspark.sql.functions import pandas_udf, PandasUDFType
+    from pyspark.sql.types import IntegerType
+    import pandas.core.series  # type: ignore[import]
+
+    @pandas_udf(IntegerType(), PandasUDFType.GROUPED_AGG)
+    def f(x: pandas.core.series.Series) -> int:
+        return 42
+
+    @pandas_udf("int", PandasUDFType.GROUPED_AGG)
+    def g(x: pandas.core.series.Series, y: pandas.core.series.Series) -> int:
+        return 42
+
+    @pandas_udf("int", PandasUDFType.GROUPED_AGG)
+    def h(*xs: pandas.core.series.Series) -> int:
+        return 42
+
+    pandas_udf(lambda x: 42, "str", PandasUDFType.GROUPED_AGG)
+    pandas_udf(lambda x, y: 42, "str", PandasUDFType.GROUPED_AGG)
+    pandas_udf(lambda *xs: 42, "str", PandasUDFType.GROUPED_AGG)
+
+
+- case: mapIterUdf
+  main: |
+    from pyspark.sql.session import SparkSession
+    from typing import Iterable
+    import pandas.core.frame  # type: ignore[import]
+
+    spark = SparkSession.builder.getOrCreate()
+
+    def f(batch_iter: Iterable[pandas.core.frame.DataFrame]) -> 
Iterable[pandas.core.frame.DataFrame]:
+        for pdf in batch_iter:
+            yield pdf[pdf.id == 1]
+
+    spark.range(1).mapInPandas(f, "id long").show()
+
+
+- case: legacyUDF
+  main: |
+    from pyspark.sql.functions import udf
+    from pyspark.sql.types import IntegerType
+
+    udf(lambda x: x, "string")
+
+    udf(lambda x: x)
+
+    @udf("string")
+    def f(x: str) -> str:
+        return x
+
+    @udf(returnType="string")
+    def g(x: str) -> str:
+        return x
+
+    @udf(returnType=IntegerType())
+    def h(x: int) -> int:
+        return x
+
+    @udf
+    def i(x: str) -> str:
+        return x
+
+
+- case: cogroupedAggUdf
+  main: |
+    from pyspark.sql.session import SparkSession
+    import pandas.core.frame  # type: ignore[import]
+    from  pyspark.sql.types import StructType, StructField, LongType
+
+    spark = SparkSession.builder.getOrCreate()
+
+    dfg1 = spark.range(1).groupBy("id")
+    dfg2 = spark.range(1).groupBy("id")
+
+    def f(x: pandas.core.frame.DataFrame, y: pandas.core.frame.DataFrame) -> 
pandas.core.frame.DataFrame:
+        return x
+
+    dfg1.cogroup(dfg2).applyInPandas(f, "id int")
+    dfg1.cogroup(dfg2).applyInPandas(f, StructType([StructField("id", 
LongType())]))
diff --git a/python/pyspark/tests/typing/test_context.yml 
b/python/pyspark/tests/typing/test_context.yml
new file mode 100644
index 0000000..1217651
--- /dev/null
+++ b/python/pyspark/tests/typing/test_context.yml
@@ -0,0 +1,21 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: contextInitalization
+  main: |
+    from pyspark import SparkContext
+    sc: SparkContext = SparkContext()
diff --git a/python/pyspark/tests/typing/test_core.yml 
b/python/pyspark/tests/typing/test_core.yml
new file mode 100644
index 0000000..ff58613
--- /dev/null
+++ b/python/pyspark/tests/typing/test_core.yml
@@ -0,0 +1,20 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: coreImports
+  main: |
+    from pyspark import keyword_only, Row, SQLContext
diff --git a/python/pyspark/tests/typing/test_rdd.yml 
b/python/pyspark/tests/typing/test_rdd.yml
new file mode 100644
index 0000000..749ad53
--- /dev/null
+++ b/python/pyspark/tests/typing/test_rdd.yml
@@ -0,0 +1,62 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: toDF
+  main: |
+    from pyspark.sql.types import (
+      IntegerType,
+      Row,
+      StructType,
+      StringType,
+      StructField,
+    )
+    from collections import namedtuple
+    from pyspark.sql import SparkSession
+
+    spark = SparkSession.builder.getOrCreate()
+    sc = spark.sparkContext
+
+    struct = StructType([
+        StructField("a", IntegerType()),
+        StructField("b", StringType())
+    ])
+
+    AB = namedtuple("AB", ["a", "b"])
+
+    rdd_row = sc.parallelize([Row(a=1, b="foo")])
+    rdd_row.toDF()
+    rdd_row.toDF(sampleRatio=0.4)
+    rdd_row.toDF(["a", "b"], sampleRatio=0.4)
+    rdd_row.toDF(struct)
+
+    rdd_tuple = sc.parallelize([(1, "foo")])
+    rdd_tuple.toDF()
+    rdd_tuple.toDF(sampleRatio=0.4)
+    rdd_tuple.toDF(["a", "b"], sampleRatio=0.4)
+    rdd_tuple.toDF(struct)
+
+    rdd_list = sc.parallelize([[1, "foo"]])
+    rdd_list.toDF()
+    rdd_list.toDF(sampleRatio=0.4)
+    rdd_list.toDF(["a", "b"], sampleRatio=0.4)
+    rdd_list.toDF(struct)
+
+    rdd_named_tuple = sc.parallelize([AB(1, "foo")])
+    rdd_named_tuple.toDF()
+    rdd_named_tuple.toDF(sampleRatio=0.4)
+    rdd_named_tuple.toDF(["a", "b"], sampleRatio=0.4)
+    rdd_named_tuple.toDF(struct)
diff --git a/python/pyspark/tests/typing/test_resultiterable.yml 
b/python/pyspark/tests/typing/test_resultiterable.yml
new file mode 100644
index 0000000..4261a29
--- /dev/null
+++ b/python/pyspark/tests/typing/test_resultiterable.yml
@@ -0,0 +1,22 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+- case: resultIterable
+  main: |
+    from pyspark.resultiterable import ResultIterable
+
+    ResultIterable([])

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-36989][TESTS][PYTHON] Add type hints data tests

Reply via email to