Re: [DISCUSS] Spark - How to improve our release processes

Nicholas Chammas Thu, 06 Feb 2025 07:52:10 -0800

Your first several points align with what I explained for Python regarding 
abstract vs. concrete dependencies.


As I noted, the blocker for progress on reorganizing and cleaning up our Python 
dependencies in this way is committer alignment. 


> On Feb 6, 2025, at 9:30 AM, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
> 
> Hi,
> 
> I'll start with a disclaimer: I am mostly a Java / Scala developer so I am 
> not that well oriented with Python best practices.
> Having said that, here are some thoughts I have about the subject, hope they 
> make sense :)
> I think that we need to differentiate between code and dependencies for 
> testing purposes, code and dependencies for internal use (tools, build etc.), 
> and actual code that is for the users - like pySpark itself. The dependencies 
> of those should be differentiated since tests and tools should not pose a 
> problem for the end users on what packages they are using and of what 
> versions.
> As a follow up to the previous note, the actual Python code that runs on the 
> driver (or connect server, depends on the deployment) - have a big impact on 
> the users who use Python - and since shading is not a practice in Python 
> (unlike in JVM languages) - we should strive to use as minimal dependencies 
> as possible - so we won't impose restrictions on our users.
> We should evaluate a way to avoid conflicts between test and production 
> dependencies.
> For instance - test dependencies can be "calculated" from a list of regular 
> dependencies + test only dependencies- to make sure we are testing what we 
> are actually shipping... 
> That means that we should probably have some script to delete 
> requirements.txt and create it from files as needed (for instance 
> generalRequirements.txt, testRequirements.txt or something like that). 
> All python dependencies should not be installed locally - one approach I know 
> can be used with PyCharm is using an interpreter using Docker - to make sure 
> no local changes are impacting which run is successful and which one fails 
> due to local packages installed... (e.g - 
> https://www.jetbrains.com/help/pycharm/using-docker-as-a-remote-interpreter.html)
> Docker build should be in one central location for all purposes - with layers 
> over that Dockerfile for other Dockerfiles.
> This means that the test versions - should be based on the regular Docker 
> image, python requirements should not be written directly within many 
> different Dockerfiles - but in one - or at least in a requirement.txt file 
> used by all of them, etc.
> Same scripts should be used locally - and within Github actions - build 
> pipelines - because only that will make sure that what we test and run 
> locally and what we publish and let others build will have the same exact 
> quality and will be constant.
> Consider using Multi-Release JAR Files ( https://openjdk.org/jeps/238 ) - a 
> feature that was added in Java 9 - to "Extend the JAR file format to allow 
> multiple, Java-release-specific versions of class files to coexist in a 
> single archive".
> In short - it lets you have multiple implementations of the same class with 
> several different Java versions.
> This feature is meant to help libraries adopt new language features more 
> easily while keeping backward compatibility - so if there is a new 
> implementation or feature that can help improve performance that we avoid 
> from using because we still support older version of Java - this help 
> mitigate it by letting us implement for users with newer JDK versions using 
> the new API - and not using it with older Java versions.
> Of course another option is to just support the latest LTS JDK version on 
> each release. In general I know there are those who are afraid of this 
> option, but since Spark applications are usually self contained and not used 
> as a library within some other Java project, I think that is also a viable 
> option that will let us use newer features as they will be available and fit 
> - for instance Virtual Threads (which can enable us running more threads per 
> machine - and in I/O intensive and network operations can provide better 
> parallelism), Vector API <https://openjdk.org/jeps/489> - that can boost 
> performance in a similar way to what Databricks' Photon and Velox lib does- 
> just directly within Java and not using C++, Ahead-of-Time Class Loading & 
> Linking <https://openjdk.org/jeps/483> - for faster startup times, Value 
> Objects <https://openjdk.org/jeps/8277163>, FFM 
> <https://openjdk.org/jeps/454> instead of JNI and many more.
> Is there a document that shows what the current release managers do to 
> actually build and release a version? Step by step?
> 
> Thanks,
> Nimrod
> 
> 
> On Tue, Feb 4, 2025 at 6:31 PM Nicholas Chammas <nicholas.cham...@gmail.com 
> <mailto:nicholas.cham...@gmail.com>> wrote:
>> I still believe that the way to solve this is by splitting our Python build 
>> requirements into two:
>> 
>> 1. Abstract dependencies: These capture the most open/flexible set of 
>> dependencies for the project. They are posted to PyPI.
>> 2. Concrete build dependencies: These are derived automatically from the 
>> abstract dependencies. The dependencies and transitive dependencies are 
>> fully enumerated and pinned to specific versions. We use and reference a 
>> single set of concrete build dependencies across GitHub Actions, Docker, and 
>> local test environments.
>> 
>> All modern Python packaging approaches follow this pattern. The abstract 
>> dependencies go in your pyproject.toml and the concrete dependencies go in a 
>> lock file.
>> 
>> Adopting modern Python packaging tooling (like uv, Poetry, or Hatch) might 
>> be too big of a change for us right now, which is why when I last tried to 
>> do this <https://github.com/apache/spark/pull/27928> I used pip-tools 
>> <https://github.com/jazzband/pip-tools>, which lets us stick to plain pip 
>> but adopt this modern pattern.
>> 
>> I’m willing to take another stab at this, but I believe it needs buy-in from 
>> Hyukjin, who was opposed to the idea last we discussed it.
>> 
>> > My understanding is that, in the PySpark CI we do not use fixed Python 
>> > library versions as we want to test with the latest library versions as 
>> > soon as possible.
>> 
>> This is my understanding too, but I believe testing against unpinned 
>> dependencies causes us so much wasted time as we play whack-a-mole with 
>> build problems. And every problem eventually gets solved by pinning a 
>> dependency, but because we are not pinning them in a consistent or automated 
>> way, we end up with a single library being specified and pinned to different 
>> versions across 10+ files 
>> <https://lists.apache.org/thread/hrs8kw31163v7tydjwm9cx5yktpvdjnj>.
>> 
>> I don’t think whatever benefit we are getting from this approach outweighs 
>> this cost in complexity and management overhead.
>> 
>> Nick
>> 
>> 
>>> On Feb 4, 2025, at 10:30 AM, Wenchen Fan <cloud0...@gmail.com 
>>> <mailto:cloud0...@gmail.com>> wrote:
>>> 
>>> + @Hyukjin Kwon <mailto:gurwls...@gmail.com> 
>>> 
>>> My understanding is that, in the PySpark CI we do not use fixed Python 
>>> library versions as we want to test with the latest library versions as 
>>> soon as possible. However, the release scripts use fixed Python library 
>>> versions to make sure it's stable. This means that for almost every major 
>>> release we need to update the release scripts to sync the Python library 
>>> versions with the CI, as the PySpark code or doc generation code may not be 
>>> compatible with the old versions after 6 months.
>>> 
>>> It would be better if we automate this process, but I don't have a good 
>>> idea now.
>>> 
>>> On Tue, Feb 4, 2025 at 6:32 PM Nimrod Ofek <ofek.nim...@gmail.com 
>>> <mailto:ofek.nim...@gmail.com>> wrote:
>>>> Hi all,
>>>> 
>>>> I am trying to revive this thread - to work towards a better release 
>>>> process, and making sure we have no conflicts in the used artifacts like 
>>>> nicholas.cham...@gmail.com <mailto:nicholas.cham...@gmail.com> mentioned.
>>>> @Wenchen Fan <mailto:cloud0...@gmail.com> - can you please clarify - you 
>>>> state that the release scripts are using a different build and Docker than 
>>>> Github Actions. 
>>>> The release scripts are releasing the artifacts that are actually being 
>>>> used... What are the other ones which are created by Github Actios today 
>>>> used for? Only testing?
>>>> 
>>>> Me personally - I believe that "release is king" - meaning what actually 
>>>> is being used by all the users is the "correct" build and we should align 
>>>> ourselves to it.
>>>> 
>>>> What do you think are the needed next steps for us to take in order to 
>>>> make the release process fully automated and simple?
>>>> 
>>>> Thanks,
>>>> Nimrod
>>>> 
>>>> 
>>>> On Mon, May 13, 2024 at 2:31 PM Wenchen Fan <cloud0...@gmail.com 
>>>> <mailto:cloud0...@gmail.com>> wrote:
>>>>> Hi Nicholas,
>>>>> 
>>>>> Thanks for your help! I'm definitely interested in participating in this 
>>>>> unification work. Let me know how I can help.
>>>>> 
>>>>> Wenchen
>>>>> 
>>>>> On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas 
>>>>> <nicholas.cham...@gmail.com <mailto:nicholas.cham...@gmail.com>> wrote:
>>>>>> Re: unification
>>>>>> 
>>>>>> We also have a long-standing problem with how we manage Python 
>>>>>> dependencies, something I’ve tried (unsuccessfully 
>>>>>> <https://github.com/apache/spark/pull/27928>) to fix in the past.
>>>>>> 
>>>>>> Consider, for example, how many separate places this numpy dependency is 
>>>>>> installed:
>>>>>> 
>>>>>> 1. 
>>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
>>>>>> 2. 
>>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
>>>>>> 3. 
>>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
>>>>>> 4. 
>>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
>>>>>> 5. 
>>>>>> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
>>>>>> 6. 
>>>>>> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
>>>>>> 7. 
>>>>>> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
>>>>>> 8. 
>>>>>> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
>>>>>> 9. 
>>>>>> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
>>>>>> 10. 
>>>>>> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
>>>>>> 11. 
>>>>>> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
>>>>>> 12. 
>>>>>> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>>>>>> 
>>>>>> None of those installations reference a unified version requirement, so 
>>>>>> naturally they are inconsistent across all these different lines. Some 
>>>>>> say `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In 
>>>>>> several cases there is no version requirement specified at all.
>>>>>> 
>>>>>> I’m interested in trying again to fix this problem, but it needs to be 
>>>>>> in collaboration with a committer since I cannot fully test the release 
>>>>>> scripts. (This testing gap is what doomed my last attempt at fixing this 
>>>>>> problem.)
>>>>>> 
>>>>>> Nick
>>>>>> 
>>>>>> 
>>>>>>> On May 13, 2024, at 12:18 AM, Wenchen Fan <cloud0...@gmail.com 
>>>>>>> <mailto:cloud0...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> After finishing the 4.0.0-preview1 RC1, I have more experience with 
>>>>>>> this topic now.
>>>>>>> 
>>>>>>> In fact, the main job of the release process: building packages and 
>>>>>>> documents, is tested in Github Action jobs. However, the way we test 
>>>>>>> them is different from what we do in the release scripts.
>>>>>>> 
>>>>>>> 1. the execution environment is different:
>>>>>>> The release scripts define the execution environment with this 
>>>>>>> Dockerfile: 
>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
>>>>>>> However, Github Action jobs use a different Dockerfile: 
>>>>>>> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
>>>>>>> We should figure out a way to unify it. The docker image for the 
>>>>>>> release process needs to set up more things so it may not be viable to 
>>>>>>> use a single Dockerfile for both.
>>>>>>> 
>>>>>>> 2. the execution code is different. Use building documents as an 
>>>>>>> example:
>>>>>>> The release scripts: 
>>>>>>> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
>>>>>>> The Github Action job: 
>>>>>>> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
>>>>>>> I don't know which one is more correct, but we should definitely unify 
>>>>>>> them.
>>>>>>> 
>>>>>>> It's better if we can run the release scripts as Github Action jobs, 
>>>>>>> but I think it's more important to do the unification now.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Wenchen
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, May 10, 2024 at 12:34 AM Hussein Awala <huss...@awala.fr 
>>>>>>> <mailto:huss...@awala.fr>> wrote:
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> I can answer some of your common questions with other Apache projects.
>>>>>>>> 
>>>>>>>> > Who currently has permissions for Github actions? Is there a 
>>>>>>>> > specific owner for that today or a different volunteer each time?
>>>>>>>> 
>>>>>>>> The Apache organization owns Github Actions, and committers 
>>>>>>>> (contributors with write permissions) can retrigger/cancel a Github 
>>>>>>>> Actions workflow, but Github Actions runners are managed by the Apache 
>>>>>>>> infra team.
>>>>>>>> 
>>>>>>>> > What are the current limits of GitHub Actions, who set them - and 
>>>>>>>> > what is the process to change those (if possible at all, but I 
>>>>>>>> > presume not all Apache projects have the same limits)?
>>>>>>>> 
>>>>>>>> For limits, I don't think there is any significant limit, especially 
>>>>>>>> since the Apache organization has 900 donated runners used by its 
>>>>>>>> projects, and there is an initiative from the Infra team to add 
>>>>>>>> self-hosted runners running on Kubernetes (document 
>>>>>>>> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners>).
>>>>>>>> 
>>>>>>>> > Where should the artifacts be stored?
>>>>>>>> 
>>>>>>>> Usually, we use Maven for jars, DockerHub for Docker images, and 
>>>>>>>> Github cache for workflow cache. But we can use Github artifacts to 
>>>>>>>> store any kind of package (even Docker images in the ghcr), which is 
>>>>>>>> fully accepted by Apache policies. Also if the project has a cloud 
>>>>>>>> account (AWS, GCP, Azure, ...), a bucket can be used to store some of 
>>>>>>>> the packages.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  > Who should be permitted to sign a version - and what is the process 
>>>>>>>> for that?
>>>>>>>> 
>>>>>>>> The Apache documentation is clear about this, by default only PMC 
>>>>>>>> members can be release managers, but we can contact the infra team to 
>>>>>>>> add one of the committers as a release manager (document 
>>>>>>>> <https://infra.apache.org/release-publishing.html#releasemanager>). 
>>>>>>>> The process of creating a new version is described in this document 
>>>>>>>> <https://www.apache.org/legal/release-policy.html#policy>.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek <ofek.nim...@gmail.com 
>>>>>>>> <mailto:ofek.nim...@gmail.com>> wrote:
>>>>>>>>> Following the conversation started with Spark 4.0.0 release, this is 
>>>>>>>>> a thread to discuss improvements to our release processes.
>>>>>>>>> 
>>>>>>>>> I'll Start by raising some questions that probably should have 
>>>>>>>>> answers to start the discussion:
>>>>>>>>> 
>>>>>>>>> What is currently running in GitHub Actions?
>>>>>>>>> Who currently has permissions for Github actions? Is there a specific 
>>>>>>>>> owner for that today or a different volunteer each time?
>>>>>>>>> What are the current limits of GitHub Actions, who set them - and 
>>>>>>>>> what is the process to change those (if possible at all, but I 
>>>>>>>>> presume not all Apache projects have the same limits)?
>>>>>>>>> What versions should we support as an output for the build?
>>>>>>>>> Where should the artifacts be stored?
>>>>>>>>> What should be the output? only tar or also a docker image published 
>>>>>>>>> somewhere?
>>>>>>>>> Do we want to have a release on fixed dates or a manual release upon 
>>>>>>>>> request?
>>>>>>>>> Who should be permitted to sign a version - and what is the process 
>>>>>>>>> for that?
>>>>>>>>> 
>>>>>>>>> Thanks!
>>>>>>>>> Nimrod
>>>>>> 
>>

Re: [DISCUSS] Spark - How to improve our release processes

Reply via email to