Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
Release _branches_ are tested as commits arrive to the branch, yes. That's
what you see at https://github.com/apache/spark/actions
Released versions are fixed, they don't change, and were also manually
tested before release, so no they are not re-tested; there is no need.

You presumably have some local env issue, because the source of Spark 3.2.3
was passing CI/CD at time of release as well as manual tests of the PMC.


On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina  wrote:

> Hi Sean,
>
> That’s fair in regards to 3.3.x being the current release branch. I’m not
> familiar with the testing schedule, but I had assumed all currently
> supported release versions would have some nightly/weekly tests ran; is
> that not the case? I only ask, as when I when I’m seeing these test
> failures, I assumed these were either known/unknown from some recurring
> testing pipeline.
>
> Also, unfortunately using v3.2.3 also had the same test failures.
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>
> I’ve posted the traceback below for one of the ran tests. At the end it
> mentioned to check the logs - `see logs`. However I wasn’t sure whether
> that just meant the traceback or some more detailed logs elsewhere? I
> wasn’t able to see any files that looked relevant running `find . -name
> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>
> ```
> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
> ... ERROR
> test_broadcast_value_against_gc
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_no_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>
> ==
> ERROR: test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest)
> --
> Traceback (most recent call last):
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
> test_broadcast_with_encryption
> self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
> _test_multiple_broadcasts
> conf = SparkConf()
>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
> self._jconf = _jvm.SparkConf(loadDefaults)
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1709, in __getattr__
> answer = self._gateway_client.send_command(
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1036, in send_command
> connection = self._get_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 284, in _get_connection
> connection = self._create_new_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 291, in _create_new_connection
> connection.connect_to_java_server()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 438, in connect_to_java_server
> self.socket.connect((self.java_address, self.java_port))
> ConnectionRefusedError: [Errno 61] Connection refused
>
> --
> Ran 7 tests in 12.950s
>
> FAILED (errors=7)
> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>
> Had test failures in pyspark.tests.test_broadcast with
> /usr/local/bin/python3; see logs.
> ```
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 5:03 PM, Sean Owen  wrote:
>
> That isn't the released version either, but rather the head of the 3.2
> branch (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead:
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1.
> But note of course the 3.3.x is the current release branch anyway.
>
> Hard to say what the error is without seeing more of the error log.
>
> That final warning is fine, just means you are using Java 11+.
>
>
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  wrote:
>
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>>
>> Ah, so the old failing tests are passing now, but I am seeing failures in
>> `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`,
>> with a majority of them failing due to `ConnectionRefusedError: [Errno
>> 61] Connection refused`. Maybe these tests are not mean to be ran locally,
>> and only in the pipeline?
>>
>> Also, I see this warning that mentions to notify the maintainers here:
>>
>> ```
>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>> WARNING: An illegal reflective access operation has occurred
>> 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Adam Chhina
Hi Sean,

That’s fair in regards to 3.3.x being the current release branch. I’m not 
familiar with the testing schedule, but I had assumed all currently supported 
release versions would have some nightly/weekly tests ran; is that not the 
case? I only ask, as when I when I’m seeing these test failures, I assumed 
these were either known/unknown from some recurring testing pipeline.

Also, unfortunately using v3.2.3 also had the same test failures.

> git clone --branch v3.2.3 https://github.com/apache/spark.git

I’ve posted the traceback below for one of the ran tests. At the end it 
mentioned to check the logs - `see logs`. However I wasn’t sure whether that 
just meant the traceback or some more detailed logs elsewhere? I wasn’t able to 
see any files that looked relevant running `find . -name “*logs*”` afterwards. 
Sorry if I’m missing something obvious.

```
test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... 
ERROR
test_broadcast_value_against_gc (pyspark.tests.test_broadcast.BroadcastTest) 
... ERROR
test_broadcast_value_driver_encryption 
(pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_value_driver_no_encryption 
(pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_with_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... 
ERROR

==
ERROR: test_broadcast_with_encryption 
(pyspark.tests.test_broadcast.BroadcastTest)
--
Traceback (most recent call last):
  File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in 
test_broadcast_with_encryption
self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
  File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in 
_test_multiple_broadcasts
conf = SparkConf()
  File "$path/spark/python/pyspark/conf.py", line 120, in __init__
self._jconf = _jvm.SparkConf(loadDefaults)
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", 
line 1709, in __getattr__
answer = self._gateway_client.send_command(
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", 
line 1036, in send_command
connection = self._get_connection()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 284, in _get_connection
connection = self._create_new_connection()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 291, in _create_new_connection
connection.connect_to_java_server()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 438, in connect_to_java_server
self.socket.connect((self.java_address, self.java_port))
ConnectionRefusedError: [Errno 61] Connection refused

--
Ran 7 tests in 12.950s

FAILED (errors=7)
sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>

Had test failures in pyspark.tests.test_broadcast with /usr/local/bin/python3; 
see logs.
```

Best,

Adam Chhina

> On Jan 18, 2023, at 5:03 PM, Sean Owen  wrote:
> 
> That isn't the released version either, but rather the head of the 3.2 branch 
> (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead: 
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1. 
> But note of course the 3.3.x is the current release branch anyway.
> 
> Hard to say what the error is without seeing more of the error log.
> 
> That final warning is fine, just means you are using Java 11+.
> 
> 
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  > wrote:
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>> 
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git 
>> 
>> Ah, so the old failing tests are passing now, but I am seeing failures in 
>> `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, 
>> with a majority of them failing due to `ConnectionRefusedError: [Errno 61] 
>> Connection refused`. Maybe these tests are not mean to be ran locally, and 
>> only in the pipeline?
>> 
>> Also, I see this warning that mentions to notify the maintainers here:
>> 
>> ```
>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
>> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor 
>> java.nio.DirectByteBuffer(long,int)
>> ```
>> 
>> FWIW, not sure if this matters, but python executable used for running these 
>> tests is `Python 3.10.9` under `/user/local/bin/python3`.
>> 
>> Best,
>> 
>> Adam Chhina
>> 
>>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen >> > wrote:
>>> 
>>> Replace 
>>> > > git clone g...@github.com:apache/spark.git
>>> > > git 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
That isn't the released version either, but rather the head of the 3.2
branch (which is beyond 3.2.3).
You may want to check out the v3.2.3 tag instead:
https://github.com/apache/spark/tree/v3.2.3
... instead of 3.2.1.
But note of course the 3.3.x is the current release branch anyway.

Hard to say what the error is without seeing more of the error log.

That final warning is fine, just means you are using Java 11+.


On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  wrote:

> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>
> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>
> Ah, so the old failing tests are passing now, but I am seeing failures in 
> `pyspark.tests.test_broadcast`
> such as  `test_broadcast_value_against_gc`, with a majority of them
> failing due to `ConnectionRefusedError: [Errno 61] Connection refused`.
> Maybe these tests are not mean to be ran locally, and only in the pipeline?
>
> Also, I see this warning that mentions to notify the maintainers here:
>
> ```
>
> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
> java.nio.DirectByteBuffer(long,int)
> ```
>
> FWIW, not sure if this matters, but python executable used for running
> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen 
> wrote:
>
> Replace
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
>
> with
> git clone --branch branch-3.2 https://github.com/apache/spark.git
> This will give you branch 3.2 as today, what I suppose you call upstream
>
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :)
>
>
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen :
>
>> Never seen those, but it's probably a difference in pandas, numpy
>> versions. You can see the current CICD test results in GitHub Actions. But,
>> you want to use release versions, not an RC. 3.2.1 is not the latest
>> version, and it's possible the tests were actually failing in the RC.
>>
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:
>>
>>> Bump,
>>>
>>> Just trying to see where I can find what tests are known failing for a
>>> particular release, to ensure I’m building upstream correctly following the
>>> build docs. I figured this would be the best place to ask as it pertains to
>>> building and testing upstream (also more than happy to provide a PR for any
>>> docs if required afterwards), however if there would be a more appropriate
>>> place, please let me know.
>>>
>>> Best,
>>>
>>> Adam Chhina
>>>
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina 
>>> wrote:
>>> >
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>>> However, I'm running into some issues with failing unit tests, which I'm
>>> not sure are failing upstream or due to some step I missed in the build.
>>> >
>>> > The current failing tests (at least so far, since I believe the python
>>> script exits on test failure):
>>> > ```
>>> > ==
>>> > FAIL: test_train_prediction
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 474, in test_train_prediction
>>> > eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 86, in eventually
>>> > lastValue = condition()
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 469, in condition
>>> > self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> >
>>> > ==
>>> > FAIL: test_parameter_accuracy
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the final value of weights is close to the desired value.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 229, in test_parameter_accuracy
>>> > eventually(condition, timeout=60.0, catch_assertions=True)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 91, in eventually
>>> >   

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Adam Chhina
Oh, whoops, didn’t realize that wasn’t the release version, thanks!

> git clone --branch branch-3.2 https://github.com/apache/spark.git 

Ah, so the old failing tests are passing now, but I am seeing failures in 
`pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, with 
a majority of them failing due to `ConnectionRefusedError: [Errno 61] 
Connection refused`. Maybe these tests are not mean to be ran locally, and only 
in the pipeline?

Also, I see this warning that mentions to notify the maintainers here:

```
Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor 
java.nio.DirectByteBuffer(long,int)
```

FWIW, not sure if this matters, but python executable used for running these 
tests is `Python 3.10.9` under `/user/local/bin/python3`.

Best,

Adam Chhina

> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen  wrote:
> 
> Replace 
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> 
> with 
> git clone --branch branch-3.2 https://github.com/apache/spark.git
> This will give you branch 3.2 as today, what I suppose you call upstream  
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :) 
> 
> 
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen  >:
>> Never seen those, but it's probably a difference in pandas, numpy versions. 
>> You can see the current CICD test results in GitHub Actions. But, you want 
>> to use release versions, not an RC. 3.2.1 is not the latest version, and 
>> it's possible the tests were actually failing in the RC.
>> 
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina > > wrote:
>>> Bump,
>>> 
>>> Just trying to see where I can find what tests are known failing for a 
>>> particular release, to ensure I’m building upstream correctly following the 
>>> build docs. I figured this would be the best place to ask as it pertains to 
>>> building and testing upstream (also more than happy to provide a PR for any 
>>> docs if required afterwards), however if there would be a more appropriate 
>>> place, please let me know.
>>> 
>>> Best,
>>> 
>>> Adam Chhina
>>> 
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina >> > > wrote:
>>> > 
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests on 
>>> > `v3.2.1-rc2` before I applied some downstream patches and tested those. 
>>> > However, I'm running into some issues with failing unit tests, which I'm 
>>> > not sure are failing upstream or due to some step I missed in the build.
>>> > 
>>> > The current failing tests (at least so far, since I believe the python 
>>> > script exits on test failure):
>>> > ```
>>> > ==
>>> > FAIL: test_train_prediction 
>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 474, in test_train_prediction
>>> > eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, 
>>> > in eventually
>>> > lastValue = condition()
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 469, in condition
>>> > self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> > 
>>> > ==
>>> > FAIL: test_parameter_accuracy 
>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the final value of weights is close to the desired value.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 229, in test_parameter_accuracy
>>> > eventually(condition, timeout=60.0, catch_assertions=True)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, 
>>> > in eventually
>>> > raise lastValue
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, 
>>> > in eventually
>>> > lastValue = condition()
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 226, in condition
>>> > self.assertAlmostEqual(rel, 0.1, 1)
>>> > AssertionError: 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Bjørn Jørgensen
Replace
> > git clone g...@github.com:apache/spark.git
> > git checkout -b spark-321 v3.2.1

with
git clone --branch branch-3.2 https://github.com/apache/spark.git
This will give you branch 3.2 as today, what I suppose you call upstream

https://github.com/apache/spark/commits/branch-3.2
and right now all tests in github action are passed :)


ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen :

> Never seen those, but it's probably a difference in pandas, numpy
> versions. You can see the current CICD test results in GitHub Actions. But,
> you want to use release versions, not an RC. 3.2.1 is not the latest
> version, and it's possible the tests were actually failing in the RC.
>
> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:
>
>> Bump,
>>
>> Just trying to see where I can find what tests are known failing for a
>> particular release, to ensure I’m building upstream correctly following the
>> build docs. I figured this would be the best place to ask as it pertains to
>> building and testing upstream (also more than happy to provide a PR for any
>> docs if required afterwards), however if there would be a more appropriate
>> place, please let me know.
>>
>> Best,
>>
>> Adam Chhina
>>
>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina 
>> wrote:
>> >
>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>> However, I'm running into some issues with failing unit tests, which I'm
>> not sure are failing upstream or due to some step I missed in the build.
>> >
>> > The current failing tests (at least so far, since I believe the python
>> script exits on test failure):
>> > ```
>> > ==
>> > FAIL: test_train_prediction
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>> > Test that error on test data improves as model is trained.
>> > --
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 474, in test_train_prediction
>> > eventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 86, in eventually
>> > lastValue = condition()
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 469, in condition
>> > self.assertGreater(errors[1] - errors[-1], 2)
>> > AssertionError: 1.8960983527735014 not greater than 2
>> >
>> > ==
>> > FAIL: test_parameter_accuracy
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>> > Test that the final value of weights is close to the desired value.
>> > --
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 229, in test_parameter_accuracy
>> > eventually(condition, timeout=60.0, catch_assertions=True)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 91, in eventually
>> > raise lastValue
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 82, in eventually
>> > lastValue = condition()
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 226, in condition
>> > self.assertAlmostEqual(rel, 0.1, 1)
>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>> (0.13052813480829392 difference)
>> >
>> > ==
>> > FAIL: test_training_and_prediction
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>> > Test that the model improves on toy data with no. of batches
>> > --
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 334, in test_training_and_prediction
>> > eventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 93, in eventually
>> > raise AssertionError(
>> > AssertionError: Test failed due to timeout after 180 sec, with last
>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>> >
>> > --
>> > Ran 13 tests in 661.536s
>> >
>> > FAILED (failures=3, skipped=1)
>> >
>> > Had test failures in 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
Never seen those, but it's probably a difference in pandas, numpy versions.
You can see the current CICD test results in GitHub Actions. But, you want
to use release versions, not an RC. 3.2.1 is not the latest version, and
it's possible the tests were actually failing in the RC.

On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:

> Bump,
>
> Just trying to see where I can find what tests are known failing for a
> particular release, to ensure I’m building upstream correctly following the
> build docs. I figured this would be the best place to ask as it pertains to
> building and testing upstream (also more than happy to provide a PR for any
> docs if required afterwards), however if there would be a more appropriate
> place, please let me know.
>
> Best,
>
> Adam Chhina
>
> > On Dec 27, 2022, at 11:37 AM, Adam Chhina  wrote:
> >
> > As part of an upgrade I was looking to run upstream PySpark unit tests
> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
> However, I'm running into some issues with failing unit tests, which I'm
> not sure are failing upstream or due to some step I missed in the build.
> >
> > The current failing tests (at least so far, since I believe the python
> script exits on test failure):
> > ```
> > ==
> > FAIL: test_train_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> > Test that error on test data improves as model is trained.
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 474, in test_train_prediction
> > eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86,
> in eventually
> > lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 469, in condition
> > self.assertGreater(errors[1] - errors[-1], 2)
> > AssertionError: 1.8960983527735014 not greater than 2
> >
> > ==
> > FAIL: test_parameter_accuracy
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the final value of weights is close to the desired value.
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 229, in test_parameter_accuracy
> > eventually(condition, timeout=60.0, catch_assertions=True)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91,
> in eventually
> > raise lastValue
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82,
> in eventually
> > lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 226, in condition
> > self.assertAlmostEqual(rel, 0.1, 1)
> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
> (0.13052813480829392 difference)
> >
> > ==
> > FAIL: test_training_and_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the model improves on toy data with no. of batches
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 334, in test_training_and_prediction
> > eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93,
> in eventually
> > raise AssertionError(
> > AssertionError: Test failed due to timeout after 180 sec, with last
> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
> >
> > --
> > Ran 13 tests in 661.536s
> >
> > FAILED (failures=3, skipped=1)
> >
> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with
> /usr/local/bin/python3; see logs.
> > ```
> >
> > Here's how I'm currently building Spark, I was using the
> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
> docs as a reference.
> > ```
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> > > ./build/mvn -DskipTests clean package -Phive
> > > export JAVA_HOME=$(path/to/jdk/11)
> > > ./python/run-tests
> > ```
> >
> > Current Java version
> > ```
> > java 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Adam Chhina
Bump,

Just trying to see where I can find what tests are known failing for a 
particular release, to ensure I’m building upstream correctly following the 
build docs. I figured this would be the best place to ask as it pertains to 
building and testing upstream (also more than happy to provide a PR for any 
docs if required afterwards), however if there would be a more appropriate 
place, please let me know.

Best,

Adam Chhina

> On Dec 27, 2022, at 11:37 AM, Adam Chhina  wrote:
> 
> As part of an upgrade I was looking to run upstream PySpark unit tests on 
> `v3.2.1-rc2` before I applied some downstream patches and tested those. 
> However, I'm running into some issues with failing unit tests, which I'm not 
> sure are failing upstream or due to some step I missed in the build.
> 
> The current failing tests (at least so far, since I believe the python script 
> exits on test failure):
> ```
> ==
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 474, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, in 
> eventually
> lastValue = condition()
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 469, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.8960983527735014 not greater than 2
> 
> ==
> FAIL: test_parameter_accuracy 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the final value of weights is close to the desired value.
> --
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 229, in test_parameter_accuracy
> eventually(condition, timeout=60.0, catch_assertions=True)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
> raise lastValue
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
> lastValue = condition()
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 226, in condition
> self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.23052813480829393 != 0.1 within 1 places 
> (0.13052813480829392 difference)
> 
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 334, in test_training_and_prediction
> eventually(condition, timeout=180.0)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, in 
> eventually
> raise AssertionError(
> AssertionError: Test failed due to timeout after 180 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 
> 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64, 0.71, 0.78, 0.76, 
> 0.64, 0.68, 0.69, 0.72, 0.77
> 
> --
> Ran 13 tests in 661.536s
> 
> FAILED (failures=3, skipped=1)
> 
> Had test failures in pyspark.mllib.tests.test_streaming_algorithms with 
> /usr/local/bin/python3; see logs.
> ```
> 
> Here's how I'm currently building Spark, I was using the 
> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html) docs 
> as a reference.
> ```
> > git clone g...@github.com:apache/spark.git
> > git checkout -b spark-321 v3.2.1
> > ./build/mvn -DskipTests clean package -Phive
> > export JAVA_HOME=$(path/to/jdk/11)
> > ./python/run-tests
> ```
> 
> Current Java version
> ```
> java -version
> openjdk version "11.0.17" 2022-10-18
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
> ```
> 
> Alternatively, I've also tried simply building Spark and using a python=3.9 
> venv and installing the requirements from `pip install -r 
> dev/requirements.txt` and using that as the interpreter to run tests. 
> However, I was running into some failing pandas