Re: [DISCUSS] Spark - How to improve our release processes

2024-05-13 Thread Wenchen Fan
Hi Nicholas,

Thanks for your help! I'm definitely interested in participating in this
unification work. Let me know how I can help.

Wenchen

On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas 
wrote:

> Re: unification
>
> We also have a long-standing problem with how we manage Python
> dependencies, something I’ve tried (unsuccessfully
> ) to fix in the past.
>
> Consider, for example, how many separate places this numpy dependency is
> installed:
>
> 1.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
> 2.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
> 3.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
> 4.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
> 5.
> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
> 6.
> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
> 7.
> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
> 8.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
> 9.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
> 10.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
> 11.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
> 12.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>
> None of those installations reference a unified version requirement, so
> naturally they are inconsistent across all these different lines. Some say
> `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In
> several cases there is no version requirement specified at all.
>
> I’m interested in trying again to fix this problem, but it needs to be in
> collaboration with a committer since I cannot fully test the release
> scripts. (This testing gap is what doomed my last attempt at fixing this
> problem.)
>
> Nick
>
>
> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
>
> After finishing the 4.0.0-preview1 RC1, I have more experience with this
> topic now.
>
> In fact, the main job of the release process: building packages and
> documents, is tested in Github Action jobs. However, the way we test them
> is different from what we do in the release scripts.
>
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile:
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile:
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release
> process needs to set up more things so it may not be viable to use a single
> Dockerfile for both.
>
> 2. the execution code is different. Use building documents as an example:
> The release scripts:
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job:
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify
> them.
>
> It's better if we can run the release scripts as Github Action jobs, but I
> think it's more important to do the unification now.
>
> Thanks,
> Wenchen
>
>
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:
>
>> Hello,
>>
>> I can answer some of your common questions with other Apache projects.
>>
>> > Who currently has permissions for Github actions? Is there a specific
>> owner for that today or a different volunteer each time?
>>
>> The Apache organization owns Github Actions, and committers (contributors
>> with write permissions) can retrigger/cancel a Github Actions workflow, but
>> Github Actions runners are managed by the Apache infra team.
>>
>> > What are the current limits of GitHub Actions, who set them - and what
>> is the process to change those (if possible at all, but I presume not all
>> Apache projects have the same limits)?
>>
>> For limits, I don't think there is any significant limit, especially
>> since the Apache organization has 900 donated runners used by its projects,
>> and there is an initiative from the Infra team to add self-hosted runners
>> running on Kubernetes (document
>> 
>> ).
>>
>> > Where should the artifacts be 

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Nicholas Chammas
Re: unification

We also have a long-standing problem with how we manage Python dependencies, 
something I’ve tried (unsuccessfully 
) to fix in the past.

Consider, for example, how many separate places this numpy dependency is 
installed:

1. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
2. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
3. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
4. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
5. 
https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
6. 
https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
7. 
https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
8. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
9. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
10. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
11. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
12. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92

None of those installations reference a unified version requirement, so 
naturally they are inconsistent across all these different lines. Some say 
`>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In several 
cases there is no version requirement specified at all.

I’m interested in trying again to fix this problem, but it needs to be in 
collaboration with a committer since I cannot fully test the release scripts. 
(This testing gap is what doomed my last attempt at fixing this problem.)

Nick


> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
> 
> After finishing the 4.0.0-preview1 RC1, I have more experience with this 
> topic now.
> 
> In fact, the main job of the release process: building packages and 
> documents, is tested in Github Action jobs. However, the way we test them is 
> different from what we do in the release scripts.
> 
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release 
> process needs to set up more things so it may not be viable to use a single 
> Dockerfile for both.
> 
> 2. the execution code is different. Use building documents as an example:
> The release scripts: 
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job: 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify them.
> 
> It's better if we can run the release scripts as Github Action jobs, but I 
> think it's more important to do the unification now.
> 
> Thanks,
> Wenchen
> 
> 
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  > wrote:
>> Hello,
>> 
>> I can answer some of your common questions with other Apache projects.
>> 
>> > Who currently has permissions for Github actions? Is there a specific 
>> > owner for that today or a different volunteer each time?
>> 
>> The Apache organization owns Github Actions, and committers (contributors 
>> with write permissions) can retrigger/cancel a Github Actions workflow, but 
>> Github Actions runners are managed by the Apache infra team.
>> 
>> > What are the current limits of GitHub Actions, who set them - and what is 
>> > the process to change those (if possible at all, but I presume not all 
>> > Apache projects have the same limits)?
>> 
>> For limits, I don't think there is any significant limit, especially since 
>> the Apache organization has 900 donated runners used by its projects, and 
>> there is an initiative from the Infra team to add self-hosted runners 
>> running on Kubernetes (document 
>> ).
>> 
>> > Where should the artifacts be stored?
>> 
>> Usually, we use Maven for jars, DockerHub for Docker images, and Github 
>> cache for workflow cache. But we can use Github artifacts to store any kind 
>> of package (even Docker images in the 

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Wenchen Fan
After finishing the 4.0.0-preview1 RC1, I have more experience with this
topic now.

In fact, the main job of the release process: building packages and
documents, is tested in Github Action jobs. However, the way we test them
is different from what we do in the release scripts.

1. the execution environment is different:
The release scripts define the execution environment with this Dockerfile:
https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
However, Github Action jobs use a different Dockerfile:
https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
We should figure out a way to unify it. The docker image for the release
process needs to set up more things so it may not be viable to use a single
Dockerfile for both.

2. the execution code is different. Use building documents as an example:
The release scripts:
https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
The Github Action job:
https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
I don't know which one is more correct, but we should definitely unify them.

It's better if we can run the release scripts as Github Action jobs, but I
think it's more important to do the unification now.

Thanks,
Wenchen


On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:

> Hello,
>
> I can answer some of your common questions with other Apache projects.
>
> > Who currently has permissions for Github actions? Is there a specific
> owner for that today or a different volunteer each time?
>
> The Apache organization owns Github Actions, and committers (contributors
> with write permissions) can retrigger/cancel a Github Actions workflow, but
> Github Actions runners are managed by the Apache infra team.
>
> > What are the current limits of GitHub Actions, who set them - and what
> is the process to change those (if possible at all, but I presume not all
> Apache projects have the same limits)?
>
> For limits, I don't think there is any significant limit, especially since
> the Apache organization has 900 donated runners used by its projects, and
> there is an initiative from the Infra team to add self-hosted runners
> running on Kubernetes (document
> 
> ).
>
> > Where should the artifacts be stored?
>
> Usually, we use Maven for jars, DockerHub for Docker images, and Github
> cache for workflow cache. But we can use Github artifacts to store any kind
> of package (even Docker images in the ghcr), which is fully accepted by
> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
> ...), a bucket can be used to store some of the packages.
>
>
>  > Who should be permitted to sign a version - and what is the process for
> that?
>
> The Apache documentation is clear about this, by default only PMC members
> can be release managers, but we can contact the infra team to add one of
> the committers as a release manager (document
> ). The
> process of creating a new version is described in this document
> .
>
>
> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek  wrote:
>
>> Following the conversation started with Spark 4.0.0 release, this is a
>> thread to discuss improvements to our release processes.
>>
>> I'll Start by raising some questions that probably should have answers to
>> start the discussion:
>>
>>
>>1. What is currently running in GitHub Actions?
>>2. Who currently has permissions for Github actions? Is there a
>>specific owner for that today or a different volunteer each time?
>>3. What are the current limits of GitHub Actions, who set them - and
>>what is the process to change those (if possible at all, but I presume not
>>all Apache projects have the same limits)?
>>4. What versions should we support as an output for the build?
>>5. Where should the artifacts be stored?
>>6. What should be the output? only tar or also a docker image
>>published somewhere?
>>7. Do we want to have a release on fixed dates or a manual release
>>upon request?
>>8. Who should be permitted to sign a version - and what is the
>>process for that?
>>
>>
>> Thanks!
>> Nimrod
>>
>


Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan
Thanks for starting the discussion! To add a bit more color, we should at
least add a test job to make sure the release script can produce the
packages correctly. Today it's kind of being manually tested by the
release manager each time, which slows down the release process. It's
better if we can automate it entirely, so that making a release is a simple
click by authorized people.

On Thu, May 9, 2024 at 9:48 PM Nimrod Ofek  wrote:

> Following the conversation started with Spark 4.0.0 release, this is a
> thread to discuss improvements to our release processes.
>
> I'll Start by raising some questions that probably should have answers to
> start the discussion:
>
>
>1. What is currently running in GitHub Actions?
>2. Who currently has permissions for Github actions? Is there a
>specific owner for that today or a different volunteer each time?
>3. What are the current limits of GitHub Actions, who set them - and
>what is the process to change those (if possible at all, but I presume not
>all Apache projects have the same limits)?
>4. What versions should we support as an output for the build?
>5. Where should the artifacts be stored?
>6. What should be the output? only tar or also a docker image
>published somewhere?
>7. Do we want to have a release on fixed dates or a manual release
>upon request?
>8. Who should be permitted to sign a version - and what is the process
>for that?
>
>
> Thanks!
> Nimrod
>


Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Hussein Awala
Hello,

I can answer some of your common questions with other Apache projects.

> Who currently has permissions for Github actions? Is there a specific
owner for that today or a different volunteer each time?

The Apache organization owns Github Actions, and committers (contributors
with write permissions) can retrigger/cancel a Github Actions workflow, but
Github Actions runners are managed by the Apache infra team.

> What are the current limits of GitHub Actions, who set them - and what is
the process to change those (if possible at all, but I presume not all
Apache projects have the same limits)?

For limits, I don't think there is any significant limit, especially since
the Apache organization has 900 donated runners used by its projects, and
there is an initiative from the Infra team to add self-hosted runners
running on Kubernetes (document

).

> Where should the artifacts be stored?

Usually, we use Maven for jars, DockerHub for Docker images, and Github
cache for workflow cache. But we can use Github artifacts to store any kind
of package (even Docker images in the ghcr), which is fully accepted by
Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
...), a bucket can be used to store some of the packages.


 > Who should be permitted to sign a version - and what is the process for
that?

The Apache documentation is clear about this, by default only PMC members
can be release managers, but we can contact the infra team to add one of
the committers as a release manager (document
). The
process of creating a new version is described in this document
.


On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek  wrote:

> Following the conversation started with Spark 4.0.0 release, this is a
> thread to discuss improvements to our release processes.
>
> I'll Start by raising some questions that probably should have answers to
> start the discussion:
>
>
>1. What is currently running in GitHub Actions?
>2. Who currently has permissions for Github actions? Is there a
>specific owner for that today or a different volunteer each time?
>3. What are the current limits of GitHub Actions, who set them - and
>what is the process to change those (if possible at all, but I presume not
>all Apache projects have the same limits)?
>4. What versions should we support as an output for the build?
>5. Where should the artifacts be stored?
>6. What should be the output? only tar or also a docker image
>published somewhere?
>7. Do we want to have a release on fixed dates or a manual release
>upon request?
>8. Who should be permitted to sign a version - and what is the process
>for that?
>
>
> Thanks!
> Nimrod
>


[DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Nimrod Ofek
Following the conversation started with Spark 4.0.0 release, this is a
thread to discuss improvements to our release processes.

I'll Start by raising some questions that probably should have answers to
start the discussion:


   1. What is currently running in GitHub Actions?
   2. Who currently has permissions for Github actions? Is there a specific
   owner for that today or a different volunteer each time?
   3. What are the current limits of GitHub Actions, who set them - and
   what is the process to change those (if possible at all, but I presume not
   all Apache projects have the same limits)?
   4. What versions should we support as an output for the build?
   5. Where should the artifacts be stored?
   6. What should be the output? only tar or also a docker image published
   somewhere?
   7. Do we want to have a release on fixed dates or a manual release upon
   request?
   8. Who should be permitted to sign a version - and what is the process
   for that?


Thanks!
Nimrod