[spark] branch branch-2.4 updated: [SPARK-26646][TEST][PYSPARK][2.4] Fix flaky test: pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

dongjoon Sat, 17 Oct 2020 16:38:57 -0700

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new 2e72b01  [SPARK-26646][TEST][PYSPARK][2.4] Fix flaky test: 
pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
2e72b01 is described below

commit 2e72b0110c0d962a7997fddb2ef08b6613f3d338
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Sat Oct 17 16:31:42 2020 -0700

    [SPARK-26646][TEST][PYSPARK][2.4] Fix flaky test: 
pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
    
    ### What changes were proposed in this pull request?
    
    This is backport of SPARK-26646 to branch-2.4 to fix flaky test in the 
branch.
    
    ### Why are the changes needed?
    
    The test 
pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
 looks sometimes flaky.
    
    ```
    Traceback (most recent call last):
      File "/home/runner/work/spark/spark/python/pyspark/mllib/tests.py", line 
1492, in test_training_and_prediction
        self._eventually(condition, timeout=180.0)
      File "/home/runner/work/spark/spark/python/pyspark/mllib/tests.py", line 
133, in _eventually
        lastValue = condition()
      File "/home/runner/work/spark/spark/python/pyspark/mllib/tests.py", line 
1487, in condition
        self.assertGreater(errors[1] - errors[-1], 0.3)
    AssertionError: -0.07000000000000006 not greater than 0.3
    ```
    
    The predict stream can possibly be consumed to the end before the input 
stream. When it happens, the model improvement is not high as expected and 
causes test failed. This patch tries to increase number of batches of streams. 
This won't increase test time because we have a timeout there.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit test
    
    Closes #30078 from viirya/SPARK-26646-2.4.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 python/pyspark/mllib/tests.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/pyspark/mllib/tests.py b/python/pyspark/mllib/tests.py
index ec9497c..a3df358 100644
--- a/python/pyspark/mllib/tests.py
+++ b/python/pyspark/mllib/tests.py
@@ -1459,7 +1459,7 @@ class 
StreamingLogisticRegressionWithSGDTests(MLLibStreamingTestCase):
         """Test that the model improves on toy data with no. of batches"""
         input_batches = [
             self.sc.parallelize(self.generateLogisticInput(0, 1.5, 100, 42 + 
i))
-            for i in range(20)]
+            for i in range(40)]
         predict_batches = [
             b.map(lambda lp: (lp.label, lp.features)) for b in input_batches]
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[spark] branch branch-2.4 updated: [SPARK-26646][TEST][PYSPARK][2.4] Fix flaky test: pyspark.mllib.tests.StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

Reply via email to