Re: unification We also have a long-standing problem with how we manage Python dependencies, something I’ve tried (unsuccessfully <https://github.com/apache/spark/pull/27928>) to fix in the past.
Consider, for example, how many separate places this numpy dependency is installed: 1. https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277 2. https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733 3. https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853 4. https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871 5. https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70 6. https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181 7. https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5 8. https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90 9. https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99 10. https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40 11. https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89 12. https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92 None of those installations reference a unified version requirement, so naturally they are inconsistent across all these different lines. Some say `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In several cases there is no version requirement specified at all. I’m interested in trying again to fix this problem, but it needs to be in collaboration with a committer since I cannot fully test the release scripts. (This testing gap is what doomed my last attempt at fixing this problem.) Nick > On May 13, 2024, at 12:18 AM, Wenchen Fan <cloud0...@gmail.com> wrote: > > After finishing the 4.0.0-preview1 RC1, I have more experience with this > topic now. > > In fact, the main job of the release process: building packages and > documents, is tested in Github Action jobs. However, the way we test them is > different from what we do in the release scripts. > > 1. the execution environment is different: > The release scripts define the execution environment with this Dockerfile: > https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile > However, Github Action jobs use a different Dockerfile: > https://github.com/apache/spark/blob/master/dev/infra/Dockerfile > We should figure out a way to unify it. The docker image for the release > process needs to set up more things so it may not be viable to use a single > Dockerfile for both. > > 2. the execution code is different. Use building documents as an example: > The release scripts: > https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411 > The Github Action job: > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895 > I don't know which one is more correct, but we should definitely unify them. > > It's better if we can run the release scripts as Github Action jobs, but I > think it's more important to do the unification now. > > Thanks, > Wenchen > > > On Fri, May 10, 2024 at 12:34 AM Hussein Awala <huss...@awala.fr > <mailto:huss...@awala.fr>> wrote: >> Hello, >> >> I can answer some of your common questions with other Apache projects. >> >> > Who currently has permissions for Github actions? Is there a specific >> > owner for that today or a different volunteer each time? >> >> The Apache organization owns Github Actions, and committers (contributors >> with write permissions) can retrigger/cancel a Github Actions workflow, but >> Github Actions runners are managed by the Apache infra team. >> >> > What are the current limits of GitHub Actions, who set them - and what is >> > the process to change those (if possible at all, but I presume not all >> > Apache projects have the same limits)? >> >> For limits, I don't think there is any significant limit, especially since >> the Apache organization has 900 donated runners used by its projects, and >> there is an initiative from the Infra team to add self-hosted runners >> running on Kubernetes (document >> <https://cwiki.apache.org/confluence/display/INFRA/ASF+Infra+provided+self-hosted+runners>). >> >> > Where should the artifacts be stored? >> >> Usually, we use Maven for jars, DockerHub for Docker images, and Github >> cache for workflow cache. But we can use Github artifacts to store any kind >> of package (even Docker images in the ghcr), which is fully accepted by >> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure, >> ...), a bucket can be used to store some of the packages. >> >> >> > Who should be permitted to sign a version - and what is the process for >> that? >> >> The Apache documentation is clear about this, by default only PMC members >> can be release managers, but we can contact the infra team to add one of the >> committers as a release manager (document >> <https://infra.apache.org/release-publishing.html#releasemanager>). The >> process of creating a new version is described in this document >> <https://www.apache.org/legal/release-policy.html#policy>. >> >> >> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek <ofek.nim...@gmail.com >> <mailto:ofek.nim...@gmail.com>> wrote: >>> Following the conversation started with Spark 4.0.0 release, this is a >>> thread to discuss improvements to our release processes. >>> >>> I'll Start by raising some questions that probably should have answers to >>> start the discussion: >>> >>> What is currently running in GitHub Actions? >>> Who currently has permissions for Github actions? Is there a specific owner >>> for that today or a different volunteer each time? >>> What are the current limits of GitHub Actions, who set them - and what is >>> the process to change those (if possible at all, but I presume not all >>> Apache projects have the same limits)? >>> What versions should we support as an output for the build? >>> Where should the artifacts be stored? >>> What should be the output? only tar or also a docker image published >>> somewhere? >>> Do we want to have a release on fixed dates or a manual release upon >>> request? >>> Who should be permitted to sign a version - and what is the process for >>> that? >>> >>> Thanks! >>> Nimrod