Re: Building Spark to run PySpark Tests?

Adam Chhina Wed, 18 Jan 2023 13:59:25 -0800

Oh, whoops, didn’t realize that wasn’t the release version, thanks!

> git clone --branch branch-3.2 https://github.com/apache/spark.git


Ah, so the old failing tests are passing now, but I am seeing failures in 
`pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, with 
a majority of them failing due to `ConnectionRefusedError: [Errno 61] 
Connection refused`. Maybe these tests are not mean to be ran locally, and only 
in the pipeline?

Also, I see this warning that mentions to notify the maintainers here:

```
Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor 
java.nio.DirectByteBuffer(long,int)
```

FWIW, not sure if this matters, but python executable used for running these 
tests is `Python 3.10.9` under `/user/local/bin/python3`.

Best,

Adam Chhina

> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <[email protected]> wrote:
> 
> Replace 
> > > git clone [email protected]:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> 
> with 
> git clone --branch branch-3.2 https://github.com/apache/spark.git    
> This will give you branch 3.2 as today, what I suppose you call upstream      
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :) 
> 
> 
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <[email protected] 
> <mailto:[email protected]>>:
>> Never seen those, but it's probably a difference in pandas, numpy versions. 
>> You can see the current CICD test results in GitHub Actions. But, you want 
>> to use release versions, not an RC. 3.2.1 is not the latest version, and 
>> it's possible the tests were actually failing in the RC.
>> 
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> Bump,
>>> 
>>> Just trying to see where I can find what tests are known failing for a 
>>> particular release, to ensure I’m building upstream correctly following the 
>>> build docs. I figured this would be the best place to ask as it pertains to 
>>> building and testing upstream (also more than happy to provide a PR for any 
>>> docs if required afterwards), however if there would be a more appropriate 
>>> place, please let me know.
>>> 
>>> Best,
>>> 
>>> Adam Chhina
>>> 
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <[email protected] 
>>> > <mailto:[email protected]>> wrote:
>>> > 
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests on 
>>> > `v3.2.1-rc2` before I applied some downstream patches and tested those. 
>>> > However, I'm running into some issues with failing unit tests, which I'm 
>>> > not sure are failing upstream or due to some step I missed in the build.
>>> > 
>>> > The current failing tests (at least so far, since I believe the python 
>>> > script exits on test failure):
>>> > ```
>>> > ======================================================================
>>> > FAIL: test_train_prediction 
>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 474, in test_train_prediction
>>> >     eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, 
>>> > in eventually
>>> >     lastValue = condition()
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 469, in condition
>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> > 
>>> > ======================================================================
>>> > FAIL: test_parameter_accuracy 
>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the final value of weights is close to the desired value.
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 229, in test_parameter_accuracy
>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, 
>>> > in eventually
>>> >     raise lastValue
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, 
>>> > in eventually
>>> >     lastValue = condition()
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 226, in condition
>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places 
>>> > (0.13052813480829392 difference)
>>> > 
>>> > ======================================================================
>>> > FAIL: test_training_and_prediction 
>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the model improves on toy data with no. of batches
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 334, in test_training_and_prediction
>>> >     eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, 
>>> > in eventually
>>> >     raise AssertionError(
>>> > AssertionError: Test failed due to timeout after 180 sec, with last 
>>> > condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 
>>> > 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 
>>> > 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 
>>> > 0.64, 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>> > 
>>> > ----------------------------------------------------------------------
>>> > Ran 13 tests in 661.536s
>>> > 
>>> > FAILED (failures=3, skipped=1)
>>> > 
>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with 
>>> > /usr/local/bin/python3; see logs.
>>> > ```
>>> > 
>>> > Here's how I'm currently building Spark, I was using the 
>>> > [building-spark](https://spark.apache.org/docs/3..1/building-spark.html) 
>>> > docs as a reference.
>>> > ```
>>> > > git clone [email protected]:apache/spark.git
>>> > > git checkout -b spark-321 v3.2.1
>>> > > ./build/mvn -DskipTests clean package -Phive
>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>> > > ./python/run-tests
>>> > ```
>>> > 
>>> > Current Java version
>>> > ```
>>> > java -version
>>> > openjdk version "11.0.17" 2022-10-18
>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>> > ```
>>> > 
>>> > Alternatively, I've also tried simply building Spark and using a 
>>> > python=3.9 venv and installing the requirements from `pip install -r 
>>> > dev/requirements.txt` and using that as the interpreter to run tests. 
>>> > However, I was running into some failing pandas test which to me seemed 
>>> > like it was coming from a pandas version difference as `requirements.txt` 
>>> > didn't specify a version.
>>> > 
>>> > I suppose I have a couple of questions in regards to this:
>>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>>> > 2. Where could I find whether an upstream test is failing for a specific 
>>> > release?
>>> > 3. Would it be possible to configure the `run-tests` script to run all 
>>> > tests regardless of test failures?
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected] 
>>> <mailto:[email protected]>
>>> 
> 
> 
> -- 
> Bjørn Jørgensen 
> Vestre Aspehaug 4, 6010 Ålesund 
> Norge
> 
> +47 480 94 297

Re: Building Spark to run PySpark Tests?

Reply via email to