[spark] branch branch-3.0 updated: [SPARK-32506][TESTS] Flaky test: StreamingLinearRegressionWithTests

huaxingao Thu, 06 Aug 2020 14:02:04 -0700

This is an automated email from the ASF dual-hosted git repository.

huaxingao pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 30c3a50  [SPARK-32506][TESTS] Flaky test: 
StreamingLinearRegressionWithTests
30c3a50 is described below

commit 30c3a502667bfa1feaf2230b4fc4cc2d36d9b85a
Author: Huaxin Gao <[email protected]>
AuthorDate: Thu Aug 6 13:54:15 2020 -0700

    [SPARK-32506][TESTS] Flaky test: StreamingLinearRegressionWithTests
    
    ### What changes were proposed in this pull request?
    The test creates 10 batches of data  to train the model and expects to see 
error on test data improves as model is trained. If the difference between the 
2nd error and the 10th error is smaller than 2, the assertion fails:
    ```
    FAIL: test_train_prediction 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
    Test that error on test data improves as model is trained.
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File 
"/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 466, in test_train_prediction
        eventually(condition, timeout=180.0)
      File "/home/runner/work/spark/spark/python/pyspark/testing/utils.py", 
line 81, in eventually
        lastValue = condition()
      File 
"/home/runner/work/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 461, in condition
        self.assertGreater(errors[1] - errors[-1], 2)
    AssertionError: 1.672640157855923 not greater than 2
    ```
    I saw this quite a few time on Jenkins but was not able to reproduce this 
on my local. These are the ten errors I got:
    ```
    4.517395047937127
    4.894265404350079
    3.0392090466559876
    1.8786361640757654
    0.8973106042078115
    0.3715780507684368
    0.20815690742907672
    0.17333033743125845
    0.15686783249863873
    0.12584413600569616
    ```
    I am thinking of having 15 batches of data instead of 10, so the model can 
be trained for a longer time. Hopefully the 15th error - 2nd error will always 
be larger than 2 on Jenkins. These are the 15 errors I got on my local:
    ```
    4.517395047937127
    4.894265404350079
    3.0392090466559876
    1.8786361640757658
    0.8973106042078115
    0.3715780507684368
    0.20815690742907672
    0.17333033743125845
    0.15686783249863873
    0.12584413600569616
    0.11883853835108477
    0.09400261862100823
    0.08887491447353497
    0.05984929624986607
    0.07583948141520978
    ```
    
    ### Why are the changes needed?
    Fix flaky test
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Manually tested
    
    Closes #29380 from huaxingao/flaky_test.
    
    Authored-by: Huaxin Gao <[email protected]>
    Signed-off-by: Huaxin Gao <[email protected]>
    (cherry picked from commit 75c2c53e931187912a92e0b52dae0f772fa970e3)
    Signed-off-by: Huaxin Gao <[email protected]>
---
 python/pyspark/mllib/tests/test_streaming_algorithms.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/mllib/tests/test_streaming_algorithms.py 
b/python/pyspark/mllib/tests/test_streaming_algorithms.py
index 2f35e07..5818a7c 100644
--- a/python/pyspark/mllib/tests/test_streaming_algorithms.py
+++ b/python/pyspark/mllib/tests/test_streaming_algorithms.py
@@ -434,9 +434,9 @@ class 
StreamingLinearRegressionWithTests(MLLibStreamingTestCase):
         slr = StreamingLinearRegressionWithSGD(stepSize=0.2, numIterations=25)
         slr.setInitialWeights([0.0])
 
-        # Create ten batches with 100 sample points in each.
+        # Create fifteen batches with 100 sample points in each.
         batches = []
-        for i in range(10):
+        for i in range(15):
             batch = LinearDataGenerator.generateLinearInput(
                 0.0, [10.0], [0.0], [1.0 / 3.0], 100, 42 + i, 0.1)
             batches.append(self.sc.parallelize(batch))


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-3.0 updated: [SPARK-32506][TESTS] Flaky test: StreamingLinearRegressionWithTests

Reply via email to