[GitHub] [spark] zero323 commented on a change in pull request #34273: [SPARK-36997][PYTHON][TESTS] Run mypy tests against ml, sql, streaming and core examples

GitBox Wed, 13 Oct 2021 11:34:55 -0700


zero323 commented on a change in pull request #34273:
URL: https://github.com/apache/spark/pull/34273#discussion_r727940837




##########
File path: dev/lint-python
##########
@@ -124,39 +124,67 @@ function pycodestyle_test {
     fi
 }
 
-function mypy_test {
+
+function mypy_annotation_test {
     local MYPY_REPORT=
     local MYPY_STATUS=
 
-    if ! hash "$MYPY_BUILD" 2> /dev/null; then
-        echo "The $MYPY_BUILD command was not found. Skipping for now."
-        return
+    echo "starting mypy annotations test..."
+    MYPY_REPORT=$( ($MYPY_BUILD --config-file python/mypy.ini python/pyspark) 
2>&1)
+    MYPY_STATUS=$?
+
+    if [ "$MYPY_STATUS" -ne 0 ]; then
+        echo "annotations failed mypy checks:"
+        echo "$MYPY_REPORT"
+        echo "$MYPY_STATUS"
+        exit "$MYPY_STATUS"
+    else
+        echo "annotations passed mypy checks."
+        echo
     fi
+}
 
-    _MYPY_VERSION=($($MYPY_BUILD --version))
-    MYPY_VERSION="${_MYPY_VERSION[1]}"
-    EXPECTED_MYPY="$(satisfies_min_version $MYPY_VERSION $MINIMUM_MYPY)"
 
-    if [[ "$EXPECTED_MYPY" == "False" ]]; then
-        echo "The minimum mypy version needs to be $MINIMUM_MYPY. Your current 
version is $MYPY_VERSION. Skipping for now."
-        return
-    fi
+function mypy_examples_test {
+    local MYPY_REPORT=
+    local MYPY_STATUS=
+
+    echo "starting mypy examples test..."
+
+    MYPY_REPORT=$( (MYPYPATH=python $MYPY_BUILD \
+      --allow-untyped-defs \
+      --config-file python/mypy.ini \
+      --exclude "mllib/*" \

Review comment:
       At the moment, `mllib` examples yield a number of errors, which require 
adjustments of the actual annotations. I'll resolve this separately drop 
exclude later.

##########
File path: examples/src/main/python/als.py
##########
@@ -87,15 +87,15 @@ def update(i, mat, ratings):
     for i in range(ITERATIONS):
         ms = sc.parallelize(range(M), partitions) \
                .map(lambda x: update(x, usb.value, Rb.value)) \
-               .collect()
+               .collect()  # type: ignore[assignment]
         # collect() returns a list, so array ends up being
         # a 3-d array, we take the first 2 dims for the matrix
-        ms = matrix(np.array(ms)[:, :, 0])
+        ms = matrix(np.array(ms)[:, :, 0])  # type: ignore[assignment]
         msb = sc.broadcast(ms)
 
         us = sc.parallelize(range(U), partitions) \
                .map(lambda x: update(x, msb.value, Rb.value.T)) \
-               .collect()
+               .collect()  # type: ignore[assignment]

Review comment:
       These ignores are required, because we keep overwriting the same names.

##########
File path: examples/src/main/python/ml/__init__,py
##########
@@ -0,0 +1,16 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#

Review comment:
       `__init__.py` files where added to examples directories, so mypy treats 
them as valid Python packages. Otherwise, mypy would collect module names alone 
and fail due duplicate names.
   
   Alternatively, we can run tests in loop for each directory, but this seems 
cleaner.

##########
File path: examples/src/main/python/ml/chi_square_test_example.py
##########
@@ -42,9 +42,9 @@
     df = spark.createDataFrame(data, ["label", "features"])
 
     r = ChiSquareTest.test(df, "features", "label").head()
-    print("pValues: " + str(r.pValues))
-    print("degreesOfFreedom: " + str(r.degreesOfFreedom))
-    print("statistics: " + str(r.statistics))
+    print("pValues: " + str(r.pValues))  # type: ignore[union-attr]
+    print("degreesOfFreedom: " + str(r.degreesOfFreedom))  # type: 
ignore[union-attr]
+    print("statistics: " + str(r.statistics))   # type: ignore[union-attr]

Review comment:
       This is required, because `DataFrame.head` has rather unfriendly 
signature.

##########
File path: 
examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py
##########
@@ -87,7 +87,7 @@
     windowedCounts = words.groupBy(
         window(words.timestamp, windowDuration, slideDuration),
         words.word
-    ).count().orderBy('window')
+    ).count().orderBy('window')  # type: ignore[arg-type]

Review comment:
       During tests with `mypy 
0.920+dev.5f7eb8161bdeefcc526705a48b03115aa1e83a14` I experienced 
non-deterministic  problems resulting in
   
   ```
   
examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py:90:
 error: Argument 1 has incompatible type "str"; expected "DataFrame"  [arg-type]
   Found 1 error in 1 file (checked 8 source files)
   ```
   
   It seems like it happens because `orderBy` is created as an alias for `sort`
   
   
   
https://github.com/apache/spark/blob/9290924054f8e4e24afe3c6796465c4cf60e3a2b/python/pyspark/sql/dataframe.py#L1649
   
   and in certain conditions (?) mypy treats it as static or classmethod (could 
be wrong).
   
   I couldn't reproduce it on synthetic examples, though.  

##########
File path: examples/src/main/python/streaming/network_wordjoinsentiments.py
##########
@@ -50,10 +51,17 @@ def print_happiest_words(rdd):
     sc = SparkContext(appName="PythonStreamingNetworkWordJoinSentiments")
     ssc = StreamingContext(sc, 5)
 
+    def line_to_tuple(line: str) -> Tuple[str, str]:
+        try:
+            k, v = line.split(" ")
+            return k, v
+        except ValueError:
+            return "", ""
+
     # Read in the word-sentiment list and create a static RDD from it
     word_sentiments_file_path = "data/streaming/AFINN-111.txt"
     word_sentiments = ssc.sparkContext.textFile(word_sentiments_file_path) \
-        .map(lambda line: tuple(line.split("\t")))
+        .map(line_to_tuple)

Review comment:
       I am really not sure how to handle this.
   
   ```python
   .map(lambda line: tuple(line.split("\t")))
   ```
   returns `RDD[Tuple[str, ...]]`, which will cause cascade of errors later.
   
   We could exclude this file from type check, add casts, rewrite code to:
   
   ```python
   .map(lambda line: line.split("\t"))
   .map(lambda xs: (xs[0], xs[1]))
   ````
   
   but this felt like the least intrusive and the cleanest approach.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zero323 commented on a change in pull request #34273: [SPARK-36997][PYTHON][TESTS] Run mypy tests against ml, sql, streaming and core examples

Reply via email to