Re: Building Spark to run PySpark Tests?

Adam Chhina Wed, 18 Jan 2023 08:57:13 -0800

Bump,

Just trying to see where I can find what tests are known failing for a 
particular release, to ensure I’m building upstream correctly following the 
build docs. I figured this would be the best place to ask as it pertains to 
building and testing upstream (also more than happy to provide a PR for any 
docs if required afterwards), however if there would be a more appropriate 
place, please let me know.


Best,

Adam Chhina

> On Dec 27, 2022, at 11:37 AM, Adam Chhina <amanschh...@gmail.com> wrote:
> 
> As part of an upgrade I was looking to run upstream PySpark unit tests on 
> `v3.2.1-rc2` before I applied some downstream patches and tested those. 
> However, I'm running into some issues with failing unit tests, which I'm not 
> sure are failing upstream or due to some step I missed in the build.
> 
> The current failing tests (at least so far, since I believe the python script 
> exits on test failure):
> ```
> ======================================================================
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 474, in test_train_prediction
>     eventually(condition, timeout=180.0)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, in 
> eventually
>     lastValue = condition()
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 469, in condition
>     self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.8960983527735014 not greater than 2
> 
> ======================================================================
> FAIL: test_parameter_accuracy 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the final value of weights is close to the desired value.
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 229, in test_parameter_accuracy
>     eventually(condition, timeout=60.0, catch_assertions=True)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
>     raise lastValue
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
>     lastValue = condition()
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 226, in condition
>     self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.23052813480829393 != 0.1 within 1 places 
> (0.13052813480829392 difference)
> 
> ======================================================================
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 334, in test_training_and_prediction
>     eventually(condition, timeout=180.0)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, in 
> eventually
>     raise AssertionError(
> AssertionError: Test failed due to timeout after 180 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 
> 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64, 0.71, 0.78, 0.76, 
> 0.64, 0.68, 0.69, 0.72, 0.77
> 
> ----------------------------------------------------------------------
> Ran 13 tests in 661.536s
> 
> FAILED (failures=3, skipped=1)
> 
> Had test failures in pyspark.mllib.tests.test_streaming_algorithms with 
> /usr/local/bin/python3; see logs.
> ```
> 
> Here's how I'm currently building Spark, I was using the 
> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html) docs 
> as a reference.
> ```
> > git clone g...@github.com:apache/spark.git
> > git checkout -b spark-321 v3.2.1
> > ./build/mvn -DskipTests clean package -Phive
> > export JAVA_HOME=$(path/to/jdk/11)
> > ./python/run-tests
> ```
> 
> Current Java version
> ```
> java -version
> openjdk version "11.0.17" 2022-10-18
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
> ```
> 
> Alternatively, I've also tried simply building Spark and using a python=3.9 
> venv and installing the requirements from `pip install -r 
> dev/requirements.txt` and using that as the interpreter to run tests. 
> However, I was running into some failing pandas test which to me seemed like 
> it was coming from a pandas version difference as `requirements.txt` didn't 
> specify a version.
> 
> I suppose I have a couple of questions in regards to this:
> 1. Am I missing a build step to build Spark and run PySpark unit tests?
> 2. Where could I find whether an upstream test is failing for a specific 
> release?
> 3. Would it be possible to configure the `run-tests` script to run all tests 
> regardless of test failures?


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Building Spark to run PySpark Tests?

Reply via email to