Re: [DISCUSS] Spark - How to improve our release processes

Nicholas Chammas Tue, 04 Feb 2025 08:32:23 -0800

I still believe that the way to solve this is by splitting our Python build 
requirements into two:


1. Abstract dependencies: These capture the most open/flexible set of 
dependencies for the project. They are posted to PyPI.
2. Concrete build dependencies: These are derived automatically from the 
abstract dependencies. The dependencies and transitive dependencies are fully 
enumerated and pinned to specific versions. We use and reference a single set 
of concrete build dependencies across GitHub Actions, Docker, and local test 
environments.

All modern Python packaging approaches follow this pattern. The abstract 
dependencies go in your pyproject.toml and the concrete dependencies go in a 
lock file.

Adopting modern Python packaging tooling (like uv, Poetry, or Hatch) might be 
too big of a change for us right now, which is why when I last tried to do this 
<https://github.com/apache/spark/pull/27928> I used pip-tools 
<https://github.com/jazzband/pip-tools>, which lets us stick to plain pip but 
adopt this modern pattern.

I’m willing to take another stab at this, but I believe it needs buy-in from 
Hyukjin, who was opposed to the idea last we discussed it.

> My understanding is that, in the PySpark CI we do not use fixed Python 
> library versions as we want to test with the latest library versions as soon 
> as possible.

This is my understanding too, but I believe testing against unpinned 
dependencies causes us so much wasted time as we play whack-a-mole with build 
problems. And every problem eventually gets solved by pinning a dependency, but 
because we are not pinning them in a consistent or automated way, we end up 
with a single library being specified and pinned to different versions across 
10+ files <https://lists.apache.org/thread/hrs8kw31163v7tydjwm9cx5yktpvdjnj>.

I don’t think whatever benefit we are getting from this approach outweighs this 
cost in complexity and management overhead.

Nick


> On Feb 4, 2025, at 10:30 AM, Wenchen Fan <[email protected]> wrote:
> 
> + @Hyukjin Kwon <mailto:[email protected]> 
> 
> My understanding is that, in the PySpark CI we do not use fixed Python 
> library versions as we want to test with the latest library versions as soon 
> as possible. However, the release scripts use fixed Python library versions 
> to make sure it's stable. This means that for almost every major release we 
> need to update the release scripts to sync the Python library versions with 
> the CI, as the PySpark code or doc generation code may not be compatible with 
> the old versions after 6 months.
> 
> It would be better if we automate this process, but I don't have a good idea 
> now.
> 
> On Tue, Feb 4, 2025 at 6:32 PM Nimrod Ofek <[email protected] 
> <mailto:[email protected]>> wrote:
>> Hi all,
>> 
>> I am trying to revive this thread - to work towards a better release 
>> process, and making sure we have no conflicts in the used artifacts like 
>> [email protected] <mailto:[email protected]> mentioned.
>> @Wenchen Fan <mailto:[email protected]> - can you please clarify - you 
>> state that the release scripts are using a different build and Docker than 
>> Github Actions. 
>> The release scripts are releasing the artifacts that are actually being 
>> used... What are the other ones which are created by Github Actios today 
>> used for? Only testing?
>> 
>> Me personally - I believe that "release is king" - meaning what actually is 
>> being used by all the users is the "correct" build and we should align 
>> ourselves to it.
>> 
>> What do you think are the needed next steps for us to take in order to make 
>> the release process fully automated and simple?
>> 
>> Thanks,
>> Nimrod
>> 
>> 
>> On Mon, May 13, 2024 at 2:31 PM Wenchen Fan <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> Hi Nicholas,
>>> 
>>> Thanks for your help! I'm definitely interested in participating in this 
>>> unification work. Let me know how I can help.
>>> 
>>> Wenchen
>>> 
>>> On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas 
>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> Re: unification
>>>> 
>>>> We also have a long-standing problem with how we manage Python 
>>>> dependencies, something I’ve tried (unsuccessfully 
>>>> <https://github.com/apache/spark/pull/27928>) to fix in the past.
>>>> 
>>>> Consider, for example, how many separate places this numpy dependency is 
>>>> installed:
>>>> 
>>>> 1. 
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
>>>> 2. 
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
>>>> 3. 
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
>>>> 4. 
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
>>>> 5. 
>>>> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
>>>> 6. 
>>>> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
>>>> 7. 
>>>> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
>>>> 8. 
>>>> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
>>>> 9. 
>>>> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
>>>> 10. 
>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
>>>> 11. 
>>>> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
>>>> 12. 
>>>> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>>>> 
>>>> None of those installations reference a unified version requirement, so 
>>>> naturally they are inconsistent across all these different lines. Some say 
>>>> `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In 
>>>> several cases there is no version requirement specified at all.
>>>> 
>>>> I’m interested in trying again to fix this problem, but it needs to be in 
>>>> collaboration with a committer since I cannot fully test the release 
>>>> scripts. (This testing gap is what doomed my last attempt at fixing this 
>>>> problem.)
>>>> 
>>>> Nick
>>>> 
>>>> 
>>>>> On May 13, 2024, at 12:18 AM, Wenchen Fan <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> After finishing the 4.0.0-preview1 RC1, I have more experience with this 
>>>>> topic now.
>>>>> 
>>>>> In fact, the main job of the release process: building packages and 
>>>>> documents, is tested in Github Action jobs. However, the way we test them 
>>>>> is different from what we do in the release scripts.
>>>>> 
>>>>> 1. the execution environment is different:
>>>>> The release scripts define the execution environment with this 
>>>>> Dockerfile: 
>>>>> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
>>>>> However, Github Action jobs use a different Dockerfile: 
>>>>> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
>>>>> We should figure out a way to unify it. The docker image for the release 
>>>>> process needs to set up more things so it may not be viable to use a 
>>>>> single Dockerfile for both.
>>>>> 
>>>>> 2. the execution code is different. Use building documents as an example:
>>>>> The release scripts: 
>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
>>>>> The Github Action job: 
>>>>> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
>>>>> I don't know which one is more correct, but we should definitely unify 
>>>>> them.
>>>>> 
>>>>> It's better if we can run the release scripts as Github Action jobs, but 
>>>>> I think it's more important to do the unification now.
>>>>> 
>>>>> Thanks,
>>>>> Wenchen
>>>>> 
>>>>> 
>>>>> On Fri, May 10, 2024 at 12:34 AM Hussein Awala <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I can answer some of your common questions with other Apache projects.
>>>>>> 
>>>>>> > Who currently has permissions for Github actions? Is there a specific 
>>>>>> > owner for that today or a different volunteer each time?
>>>>>> 
>>>>>> The Apache organization owns Github Actions, and committers 
>>>>>> (contributors with write permissions) can retrigger/cancel a Github 
>>>>>> Actions workflow, but Github Actions runners are managed by the Apache 
>>>>>> infra team.
>>>>>> 
>>>>>> > What are the current limits of GitHub Actions, who set them - and what 
>>>>>> > is the process to change those (if possible at all, but I presume not 
>>>>>> > all Apache projects have the same limits)?
>>>>>> 
>>>>>> For limits, I don't think there is any significant limit, especially 
>>>>>> since the Apache organization has 900 donated runners used by its 
>>>>>> projects, and there is an initiative from the Infra team to add 
>>>>>> self-hosted runners running on Kubernetes (document 
>>>>>> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners>).
>>>>>> 
>>>>>> > Where should the artifacts be stored?
>>>>>> 
>>>>>> Usually, we use Maven for jars, DockerHub for Docker images, and Github 
>>>>>> cache for workflow cache. But we can use Github artifacts to store any 
>>>>>> kind of package (even Docker images in the ghcr), which is fully 
>>>>>> accepted by Apache policies. Also if the project has a cloud account 
>>>>>> (AWS, GCP, Azure, ...), a bucket can be used to store some of the 
>>>>>> packages.
>>>>>> 
>>>>>> 
>>>>>>  > Who should be permitted to sign a version - and what is the process 
>>>>>> for that?
>>>>>> 
>>>>>> The Apache documentation is clear about this, by default only PMC 
>>>>>> members can be release managers, but we can contact the infra team to 
>>>>>> add one of the committers as a release manager (document 
>>>>>> <https://infra.apache.org/release-publishing.html#releasemanager>). The 
>>>>>> process of creating a new version is described in this document 
>>>>>> <https://www.apache.org/legal/release-policy.html#policy>.
>>>>>> 
>>>>>> 
>>>>>> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Following the conversation started with Spark 4.0.0 release, this is a 
>>>>>>> thread to discuss improvements to our release processes.
>>>>>>> 
>>>>>>> I'll Start by raising some questions that probably should have answers 
>>>>>>> to start the discussion:
>>>>>>> 
>>>>>>> What is currently running in GitHub Actions?
>>>>>>> Who currently has permissions for Github actions? Is there a specific 
>>>>>>> owner for that today or a different volunteer each time?
>>>>>>> What are the current limits of GitHub Actions, who set them - and what 
>>>>>>> is the process to change those (if possible at all, but I presume not 
>>>>>>> all Apache projects have the same limits)?
>>>>>>> What versions should we support as an output for the build?
>>>>>>> Where should the artifacts be stored?
>>>>>>> What should be the output? only tar or also a docker image published 
>>>>>>> somewhere?
>>>>>>> Do we want to have a release on fixed dates or a manual release upon 
>>>>>>> request?
>>>>>>> Who should be permitted to sign a version - and what is the process for 
>>>>>>> that?
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> Nimrod
>>>>

Re: [DISCUSS] Spark - How to improve our release processes

Reply via email to