Re: CI feedback time

2021-04-15 Thread Jorge Cardoso Leitão
Hi,

I agree.

I'll submit two requirements though:
> - the configuration for CI builds must be kept in the Arrow repository
>(as they are currently in .github, etc.)
> - CI builds must be runnable from PRs
>

I'll submit three more:
- The result of the build (pass / did not pass) must be shown on github's
PRs
- The logs must be public and "clickable" from github
- We must not allow privileged arbitrary code execution from arbitrary users

I POCed Buildkite in January for Rust builds. See ARROW-11140

and corresponding PR https://github.com/apache/arrow/pull/9111. It
fulfilled the above requirements for docker runs.

The runner was running a rootless docker, for all PRs and branches, and
allowed people to register runners on their own repos if they wish so.

Limitations:
1. no macos and windows (no easy way to secure the runner against arbitrary
execution)
2. jobs cannot use sudo and privileged stuff (we would need a separate
queue for these, or e.g. use a user whitelist like Krisztián mentioned)

Best,
Jorge


On Thu, Apr 15, 2021 at 12:28 AM Antoine Pitrou  wrote:

>
> Hi Krisztian,
>
> Thanks for bringing this up.  This is definitely becoming a
> high-priority topic for Arrow development.
>
> I don't believe there is much opportunity for reducing the number of
> builds or their runtime.  We simply have a lot of development going on,
> and the number of different CI jobs we have is simply because we need to
> support many different configurations (and past experience has shown
> that they quickly stop working if we don't monitor them on a regular
> basis).
>
> So I think the only path forward is to build up (== buy, probably) our
> own execution resources for CI.  Whether that entails using Github
> self-hosted runners, Buildkite, or yet another system, I have no idea.
>
> I'll submit two requirements though:
> - the configuration for CI builds must be kept in the Arrow repository
>(as they are currently in .github, etc.)
> - CI builds must be runnable from PRs
>
> Regards
>
> Antoine.
>
>
> Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > Hi,
> >
> > The Apache Github Actions agent pool seems to be oversubscribed as
> > more Apache projects migrate their CI setup to GHA. We experienced
> > pretty solid feedback times (~20-30m) when we originally moved to GHA
> > but now we are roughly 5hrs behind [1].
> >
> > Based on other projects' complaints and discussions [2][3] (doesn't
> > have all the links at hand) we can't expect a short term solution from
> > infra. I think we *need* to figure out something on the project level
> > instead to maintain the overall project health and to improve the
> > development velocity.
> >
> > I don't have a concrete proposal at the moment, but we should start to
> > collect the available options. Ideas?
> >
> > Thanks, Krisztian
> >
> > [1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > [2]: https://github.com/apache/pulsar/issues/9154
> > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> >
>


Re: CI feedback time

2021-04-15 Thread Krisztián Szűcs
On Fri, Apr 16, 2021 at 1:11 AM Jed Brown  wrote:
>
> Wes McKinney  writes:
>
> > I think we should take a more serious look at Buildkite for some of our CI.
> >
> > * First of all, it's very easy to connect self-hosted workers and
> > supports ephemeral cloud workers in a way that would be difficult or
> > impossible with GHA. No need to have Infra fiddle with the admin
> > dashboard. So we could spin up extra workers during peak hours, or use
> > autoscaling to respond to demand.
> >
> > * We can set up more complex / dependent job pipelines rather than the
> > current GHA monolithic "long list of independent jobs" setup. For
> > example, we could have a fast gatekeeper job for C++ builds (which
> > lints and makes sure that everything compiles) that must pass before
> > more exhaustive longer-running jobs run.
>
> I don't have experience with Buildkite, but note that gitlab-runner is also 
> lightweight and well-featured as above. Here's an example with gatekeeping 
> stages across about 60 environments (mostly on-prem at multiple sites), 
> including explicit "pause-for-approval" to avoid unnecessary time-consuming 
> jobs.
>
> https://gitlab.com/petsc/petsc/-/pipelines/286655535
>
> We also use it for on-prem GPU-equipped CI with repositories hosted on 
> GitHub, reporting status to PRs. The Kubernetes and docker-machine executors 
> are intended for autoscaling.
>
> https://docs.gitlab.com/runner/executors/README.html

The CI technology/service we choose is just one piece of the puzzle.
We need to figure out a sustainable way of funding for the agents/runners.

Sadly we don't have many CIs with free offerings for OSS left to try
(and allowed by INFRA).


Re: CI feedback time

2021-04-15 Thread Jed Brown
Wes McKinney  writes:

> I think we should take a more serious look at Buildkite for some of our CI.
>
> * First of all, it's very easy to connect self-hosted workers and
> supports ephemeral cloud workers in a way that would be difficult or
> impossible with GHA. No need to have Infra fiddle with the admin
> dashboard. So we could spin up extra workers during peak hours, or use
> autoscaling to respond to demand.
>
> * We can set up more complex / dependent job pipelines rather than the
> current GHA monolithic "long list of independent jobs" setup. For
> example, we could have a fast gatekeeper job for C++ builds (which
> lints and makes sure that everything compiles) that must pass before
> more exhaustive longer-running jobs run.

I don't have experience with Buildkite, but note that gitlab-runner is also 
lightweight and well-featured as above. Here's an example with gatekeeping 
stages across about 60 environments (mostly on-prem at multiple sites), 
including explicit "pause-for-approval" to avoid unnecessary time-consuming 
jobs.

https://gitlab.com/petsc/petsc/-/pipelines/286655535

We also use it for on-prem GPU-equipped CI with repositories hosted on GitHub, 
reporting status to PRs. The Kubernetes and docker-machine executors are 
intended for autoscaling.

https://docs.gitlab.com/runner/executors/README.html


Re: CI feedback time

2021-04-15 Thread Krisztián Szűcs
On Thu, Apr 15, 2021 at 11:53 PM Andy Grove  wrote:
>
> I started looking at BulidKite and it would solve one large problem for the
> DataFusion/Ballista project. We really need to be running integration tests
> against large data sets (such as TPC-H @ SF=100GB) and self-hosted
> BuildKite makes this simple to accomplish. I even have some modest hardware
> that I purchased specifically for this purpose, but I wasn't confident that
> I could set this up in a secure way that would protect against malicious
> code being submitted. However, if we implement the necessary GitHub hooks
We don't need additional hooks for this particular use case, see
explanation below.
Although INFRA needs to configure hooks for each repository we want to
get commit events from.
For apache/arrow we have already hooked up a buildkite instance at
[3], this should be done for the new repositories as well.

> so that these builds only run after a committer adds an "ok to build"
> comment then I think it would be fine. This is the approach used in Apache
> Spark.
The build needs to query the pull request data from the github API
(since the event payload is not available by default on BK). There is
a field called author association [2] which contains the necessary
information to decide whether a pull request's author is trustworthy.
We already use the same mechanism [1] to handle the comment bot
(@github-actions) requests. Therefore we don't need to explicitly mark
a PR as "ok to build" sparing a manual step.

[1]: https://github.com/apache/arrow/blob/master/dev/archery/archery/bot.py#L98
[2]: https://docs.github.com/en/graphql/reference/enums#commentauthorassociation
[3]: https://buildkite.com/apache-arrow
>
> On Thu, Apr 15, 2021 at 3:45 PM Wes McKinney  wrote:
>
> > I think we should take a more serious look at Buildkite for some of our CI.
> >
> > * First of all, it's very easy to connect self-hosted workers and
> > supports ephemeral cloud workers in a way that would be difficult or
> > impossible with GHA. No need to have Infra fiddle with the admin
> > dashboard. So we could spin up extra workers during peak hours, or use
> > autoscaling to respond to demand.
> >
> > * We can set up more complex / dependent job pipelines rather than the
> > current GHA monolithic "long list of independent jobs" setup. For
> > example, we could have a fast gatekeeper job for C++ builds (which
> > lints and makes sure that everything compiles) that must pass before
> > more exhaustive longer-running jobs run.
> >
> > On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs
> >  wrote:
> > >
> > > On Thu, Apr 15, 2021 at 2:13 AM Weston Pace 
> > wrote:
> > > >
> > > > It may be worth reaching out to the Airflow project.  Based on
> > > >
> > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> > > > it seems they have been investing time into figuring how to make
> > > > self-hosted runners work (it seems Github's patching model makes this
> > > > somewhat difficult).
> > >
> > > We tried to use github actions self hosted runners previously. Even
> > > though Airflow manages to harden the security issues of the self
> > > hosted runners (which actually affects all hosted agent based CIs like
> > > buildkite as well) registering and managing github agents require
> > > admin privileges on the repository, which we don't have.
> > > In order to register a github self hosted runner we need to exchange
> > > registration tokens with the Apache INFRA team per agent instances.
> > > Further issues:
> > > - a registration token expires in an hour
> > > - troubleshooting the agent<->github communication is not possible
> > > without involving additional INFRA roundtrips.
> > >
> > > >
> > > > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou 
> > wrote:
> > > > >
> > > > >
> > > > > Hi Krisztian,
> > > > >
> > > > > Thanks for bringing this up.  This is definitely becoming a
> > > > > high-priority topic for Arrow development.
> > > > >
> > > > > I don't believe there is much opportunity for reducing the number of
> > > > > builds or their runtime.  We simply have a lot of development going
> > on,
> > > > > and the number of different CI jobs we have is simply because we
> > need to
> > > > > support many different configurations (and past experience has shown
> > > > > that they quickly stop working if we don't monitor them on a regular
> > basis).
> > > > >
> > > > > So I think the only path forward is to build up (== buy, probably)
> > our
> > > > > own execution resources for CI.  Whether that entails using Github
> > > > > self-hosted runners, Buildkite, or yet another system, I have no
> > idea.
> > > > >
> > > > > I'll submit two requirements though:
> > > > > - the configuration for CI builds must be kept in the Arrow
> > repository
> > > > >(as they are currently in .github, etc.)
> > > > > - CI builds must be runnable from PRs
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > Le 15/04/2021 à 00:1

Re: CI feedback time

2021-04-15 Thread Andy Grove
I started looking at BulidKite and it would solve one large problem for the
DataFusion/Ballista project. We really need to be running integration tests
against large data sets (such as TPC-H @ SF=100GB) and self-hosted
BuildKite makes this simple to accomplish. I even have some modest hardware
that I purchased specifically for this purpose, but I wasn't confident that
I could set this up in a secure way that would protect against malicious
code being submitted. However, if we implement the necessary GitHub hooks
so that these builds only run after a committer adds an "ok to build"
comment then I think it would be fine. This is the approach used in Apache
Spark.

On Thu, Apr 15, 2021 at 3:45 PM Wes McKinney  wrote:

> I think we should take a more serious look at Buildkite for some of our CI.
>
> * First of all, it's very easy to connect self-hosted workers and
> supports ephemeral cloud workers in a way that would be difficult or
> impossible with GHA. No need to have Infra fiddle with the admin
> dashboard. So we could spin up extra workers during peak hours, or use
> autoscaling to respond to demand.
>
> * We can set up more complex / dependent job pipelines rather than the
> current GHA monolithic "long list of independent jobs" setup. For
> example, we could have a fast gatekeeper job for C++ builds (which
> lints and makes sure that everything compiles) that must pass before
> more exhaustive longer-running jobs run.
>
> On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs
>  wrote:
> >
> > On Thu, Apr 15, 2021 at 2:13 AM Weston Pace 
> wrote:
> > >
> > > It may be worth reaching out to the Airflow project.  Based on
> > >
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> > > it seems they have been investing time into figuring how to make
> > > self-hosted runners work (it seems Github's patching model makes this
> > > somewhat difficult).
> >
> > We tried to use github actions self hosted runners previously. Even
> > though Airflow manages to harden the security issues of the self
> > hosted runners (which actually affects all hosted agent based CIs like
> > buildkite as well) registering and managing github agents require
> > admin privileges on the repository, which we don't have.
> > In order to register a github self hosted runner we need to exchange
> > registration tokens with the Apache INFRA team per agent instances.
> > Further issues:
> > - a registration token expires in an hour
> > - troubleshooting the agent<->github communication is not possible
> > without involving additional INFRA roundtrips.
> >
> > >
> > > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou 
> wrote:
> > > >
> > > >
> > > > Hi Krisztian,
> > > >
> > > > Thanks for bringing this up.  This is definitely becoming a
> > > > high-priority topic for Arrow development.
> > > >
> > > > I don't believe there is much opportunity for reducing the number of
> > > > builds or their runtime.  We simply have a lot of development going
> on,
> > > > and the number of different CI jobs we have is simply because we
> need to
> > > > support many different configurations (and past experience has shown
> > > > that they quickly stop working if we don't monitor them on a regular
> basis).
> > > >
> > > > So I think the only path forward is to build up (== buy, probably)
> our
> > > > own execution resources for CI.  Whether that entails using Github
> > > > self-hosted runners, Buildkite, or yet another system, I have no
> idea.
> > > >
> > > > I'll submit two requirements though:
> > > > - the configuration for CI builds must be kept in the Arrow
> repository
> > > >(as they are currently in .github, etc.)
> > > > - CI builds must be runnable from PRs
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > > > > Hi,
> > > > >
> > > > > The Apache Github Actions agent pool seems to be oversubscribed as
> > > > > more Apache projects migrate their CI setup to GHA. We experienced
> > > > > pretty solid feedback times (~20-30m) when we originally moved to
> GHA
> > > > > but now we are roughly 5hrs behind [1].
> > > > >
> > > > > Based on other projects' complaints and discussions [2][3] (doesn't
> > > > > have all the links at hand) we can't expect a short term solution
> from
> > > > > infra. I think we *need* to figure out something on the project
> level
> > > > > instead to maintain the overall project health and to improve the
> > > > > development velocity.
> > > > >
> > > > > I don't have a concrete proposal at the moment, but we should
> start to
> > > > > collect the available options. Ideas?
> > > > >
> > > > > Thanks, Krisztian
> > > > >
> > > > > [1]:
> https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > > > > [2]: https://github.com/apache/pulsar/issues/9154
> > > > > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> > > > >
>


Re: CI feedback time

2021-04-15 Thread Wes McKinney
I think we should take a more serious look at Buildkite for some of our CI.

* First of all, it's very easy to connect self-hosted workers and
supports ephemeral cloud workers in a way that would be difficult or
impossible with GHA. No need to have Infra fiddle with the admin
dashboard. So we could spin up extra workers during peak hours, or use
autoscaling to respond to demand.

* We can set up more complex / dependent job pipelines rather than the
current GHA monolithic "long list of independent jobs" setup. For
example, we could have a fast gatekeeper job for C++ builds (which
lints and makes sure that everything compiles) that must pass before
more exhaustive longer-running jobs run.

On Thu, Apr 15, 2021 at 6:19 AM Krisztián Szűcs
 wrote:
>
> On Thu, Apr 15, 2021 at 2:13 AM Weston Pace  wrote:
> >
> > It may be worth reaching out to the Airflow project.  Based on
> > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> > it seems they have been investing time into figuring how to make
> > self-hosted runners work (it seems Github's patching model makes this
> > somewhat difficult).
>
> We tried to use github actions self hosted runners previously. Even
> though Airflow manages to harden the security issues of the self
> hosted runners (which actually affects all hosted agent based CIs like
> buildkite as well) registering and managing github agents require
> admin privileges on the repository, which we don't have.
> In order to register a github self hosted runner we need to exchange
> registration tokens with the Apache INFRA team per agent instances.
> Further issues:
> - a registration token expires in an hour
> - troubleshooting the agent<->github communication is not possible
> without involving additional INFRA roundtrips.
>
> >
> > On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou  wrote:
> > >
> > >
> > > Hi Krisztian,
> > >
> > > Thanks for bringing this up.  This is definitely becoming a
> > > high-priority topic for Arrow development.
> > >
> > > I don't believe there is much opportunity for reducing the number of
> > > builds or their runtime.  We simply have a lot of development going on,
> > > and the number of different CI jobs we have is simply because we need to
> > > support many different configurations (and past experience has shown
> > > that they quickly stop working if we don't monitor them on a regular 
> > > basis).
> > >
> > > So I think the only path forward is to build up (== buy, probably) our
> > > own execution resources for CI.  Whether that entails using Github
> > > self-hosted runners, Buildkite, or yet another system, I have no idea.
> > >
> > > I'll submit two requirements though:
> > > - the configuration for CI builds must be kept in the Arrow repository
> > >(as they are currently in .github, etc.)
> > > - CI builds must be runnable from PRs
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > > > Hi,
> > > >
> > > > The Apache Github Actions agent pool seems to be oversubscribed as
> > > > more Apache projects migrate their CI setup to GHA. We experienced
> > > > pretty solid feedback times (~20-30m) when we originally moved to GHA
> > > > but now we are roughly 5hrs behind [1].
> > > >
> > > > Based on other projects' complaints and discussions [2][3] (doesn't
> > > > have all the links at hand) we can't expect a short term solution from
> > > > infra. I think we *need* to figure out something on the project level
> > > > instead to maintain the overall project health and to improve the
> > > > development velocity.
> > > >
> > > > I don't have a concrete proposal at the moment, but we should start to
> > > > collect the available options. Ideas?
> > > >
> > > > Thanks, Krisztian
> > > >
> > > > [1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > > > [2]: https://github.com/apache/pulsar/issues/9154
> > > > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> > > >


Re: CI feedback time

2021-04-15 Thread Krisztián Szűcs
On Thu, Apr 15, 2021 at 2:13 AM Weston Pace  wrote:
>
> It may be worth reaching out to the Airflow project.  Based on
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> it seems they have been investing time into figuring how to make
> self-hosted runners work (it seems Github's patching model makes this
> somewhat difficult).

We tried to use github actions self hosted runners previously. Even
though Airflow manages to harden the security issues of the self
hosted runners (which actually affects all hosted agent based CIs like
buildkite as well) registering and managing github agents require
admin privileges on the repository, which we don't have.
In order to register a github self hosted runner we need to exchange
registration tokens with the Apache INFRA team per agent instances.
Further issues:
- a registration token expires in an hour
- troubleshooting the agent<->github communication is not possible
without involving additional INFRA roundtrips.

>
> On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou  wrote:
> >
> >
> > Hi Krisztian,
> >
> > Thanks for bringing this up.  This is definitely becoming a
> > high-priority topic for Arrow development.
> >
> > I don't believe there is much opportunity for reducing the number of
> > builds or their runtime.  We simply have a lot of development going on,
> > and the number of different CI jobs we have is simply because we need to
> > support many different configurations (and past experience has shown
> > that they quickly stop working if we don't monitor them on a regular basis).
> >
> > So I think the only path forward is to build up (== buy, probably) our
> > own execution resources for CI.  Whether that entails using Github
> > self-hosted runners, Buildkite, or yet another system, I have no idea.
> >
> > I'll submit two requirements though:
> > - the configuration for CI builds must be kept in the Arrow repository
> >(as they are currently in .github, etc.)
> > - CI builds must be runnable from PRs
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > > Hi,
> > >
> > > The Apache Github Actions agent pool seems to be oversubscribed as
> > > more Apache projects migrate their CI setup to GHA. We experienced
> > > pretty solid feedback times (~20-30m) when we originally moved to GHA
> > > but now we are roughly 5hrs behind [1].
> > >
> > > Based on other projects' complaints and discussions [2][3] (doesn't
> > > have all the links at hand) we can't expect a short term solution from
> > > infra. I think we *need* to figure out something on the project level
> > > instead to maintain the overall project health and to improve the
> > > development velocity.
> > >
> > > I don't have a concrete proposal at the moment, but we should start to
> > > collect the available options. Ideas?
> > >
> > > Thanks, Krisztian
> > >
> > > [1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > > [2]: https://github.com/apache/pulsar/issues/9154
> > > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> > >


Re: CI feedback time

2021-04-15 Thread Krisztián Szűcs
On Thu, Apr 15, 2021 at 10:48 AM Antoine Pitrou  wrote:
>
>
> Le 15/04/2021 à 03:13, Kazuaki Ishizaki a écrit :
> > As we know this is a common issue among Apache projects. While the
> > projects do not have the final solution, Apache Spark project has a
> > mechanism [1][2] to run a test in own local (forked) repository. Can we
> > alleviate the problem a little bit?
>
> Anyone can already enable AppVeyor, Travis-CI and Github Actions on
> their own fork.  There is no particular action to do here.

There is a slight but meaningful difference. The fork is building the
pull request's branch (refs/pull//head) whereas the pull
request builds a reference created by github by merging the fork's
branch to the pull request's base branch (refs/pull//merge).
If we would merge based on the fork's CI status we may have issues on
the main branch after the merge.
This is what the spark pull request does, it merges [1] the pull
request's branch with the pull request's base branch.

[1]: 
https://github.com/apache/spark/pull/29504/files#diff-48c0ee97c53013d18d6bbae44648f7fab9af2e0bf5b0dc1ca761e18ec5c478f2R99
>
> Regards
>
> Antoine.


Re: CI feedback time

2021-04-15 Thread Antoine Pitrou



Le 15/04/2021 à 03:13, Kazuaki Ishizaki a écrit :

As we know this is a common issue among Apache projects. While the
projects do not have the final solution, Apache Spark project has a
mechanism [1][2] to run a test in own local (forked) repository. Can we
alleviate the problem a little bit?


Anyone can already enable AppVeyor, Travis-CI and Github Actions on 
their own fork.  There is no particular action to do here.


Regards

Antoine.


Re: CI feedback time

2021-04-14 Thread Kazuaki Ishizaki
As we know this is a common issue among Apache projects. While the 
projects do not have the final solution, Apache Spark project has a 
mechanism [1][2] to run a test in own local (forked) repository. Can we 
alleviate the problem a little bit?

[1] https://github.com/apache/spark/pull/29504
[2] https://github.com/apache/spark-website/pull/286

Regards,

Kazuaki Ishizaki

Weston Pace  wrote on 2021/04/15 09:13:05:

> From: Weston Pace 
> To: dev@arrow.apache.org
> Date: 2021/04/15 09:13
> Subject: [EXTERNAL] Re: CI feedback time
> 
> It may be worth reaching out to the Airflow project.  Based on
> INVALID URI REMOVED
> 
u=https-3A__cwiki.apache.org_confluence_display_BUILDS_GitHub-2BActions-2Bstatus&d=DwIFaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg&m=jvAxQ5CeH5B-
> RJFoUC0C578xEYg1v24trTR-
> NGxnIJU&s=r_4tQJDEfMyTcFxbBMmOFuNrn_yH8bi2dKTMnqUkVrU&e= 
> it seems they have been investing time into figuring how to make
> self-hosted runners work (it seems Github's patching model makes this
> somewhat difficult).
> 
> On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou  
wrote:
> >
> >
> > Hi Krisztian,
> >
> > Thanks for bringing this up.  This is definitely becoming a
> > high-priority topic for Arrow development.
> >
> > I don't believe there is much opportunity for reducing the number of
> > builds or their runtime.  We simply have a lot of development going 
on,
> > and the number of different CI jobs we have is simply because we need 
to
> > support many different configurations (and past experience has shown
> > that they quickly stop working if we don't monitor them on a regular 
basis).
> >
> > So I think the only path forward is to build up (== buy, probably) our
> > own execution resources for CI.  Whether that entails using Github
> > self-hosted runners, Buildkite, or yet another system, I have no idea.
> >
> > I'll submit two requirements though:
> > - the configuration for CI builds must be kept in the Arrow repository
> >(as they are currently in .github, etc.)
> > - CI builds must be runnable from PRs
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > > Hi,
> > >
> > > The Apache Github Actions agent pool seems to be oversubscribed as
> > > more Apache projects migrate their CI setup to GHA. We experienced
> > > pretty solid feedback times (~20-30m) when we originally moved to 
GHA
> > > but now we are roughly 5hrs behind [1].
> > >
> > > Based on other projects' complaints and discussions [2][3] (doesn't
> > > have all the links at hand) we can't expect a short term solution 
from
> > > infra. I think we *need* to figure out something on the project 
level
> > > instead to maintain the overall project health and to improve the
> > > development velocity.
> > >
> > > I don't have a concrete proposal at the moment, but we should start 
to
> > > collect the available options. Ideas?
> > >
> > > Thanks, Krisztian
> > >
> > > [1]: INVALID URI REMOVED
> 
u=https-3A__github.com_apache_arrow_actions-3Fquery-3Dis-253Ain-5Fprogress&d=DwIFaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg&m=jvAxQ5CeH5B-
> RJFoUC0C578xEYg1v24trTR-NGxnIJU&s=GN2DAt-
> n72kMqfoaVZd9aNn_6eLGhTKb4uMBgJvLrNs&e= 
> > > [2]: INVALID URI REMOVED
> 
u=https-3A__github.com_apache_pulsar_issues_9154&d=DwIFaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg&m=jvAxQ5CeH5B-
> RJFoUC0C578xEYg1v24trTR-
> NGxnIJU&s=2GvMYXYBAdKvDWPLlxgQYhkk0pyPyzq9mRvBevvVqsM&e= 
> > > [3]: INVALID URI REMOVED
> 
u=https-3A__issues.apache.org_jira_browse_SPARK-2D34053&d=DwIFaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=b70dG_9wpCdZSkBJahHYQ4IwKMdp2hQM29f-ZCGj9Pg&m=jvAxQ5CeH5B-
> RJFoUC0C578xEYg1v24trTR-NGxnIJU&s=fiQY4K7tuzBqXG8csD-
> tN1nDKAh5S_gn7Sotng6GMdg&e= 
> > >
> 




Re: CI feedback time

2021-04-14 Thread Weston Pace
It may be worth reaching out to the Airflow project.  Based on
https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
it seems they have been investing time into figuring how to make
self-hosted runners work (it seems Github's patching model makes this
somewhat difficult).

On Wed, Apr 14, 2021 at 12:28 PM Antoine Pitrou  wrote:
>
>
> Hi Krisztian,
>
> Thanks for bringing this up.  This is definitely becoming a
> high-priority topic for Arrow development.
>
> I don't believe there is much opportunity for reducing the number of
> builds or their runtime.  We simply have a lot of development going on,
> and the number of different CI jobs we have is simply because we need to
> support many different configurations (and past experience has shown
> that they quickly stop working if we don't monitor them on a regular basis).
>
> So I think the only path forward is to build up (== buy, probably) our
> own execution resources for CI.  Whether that entails using Github
> self-hosted runners, Buildkite, or yet another system, I have no idea.
>
> I'll submit two requirements though:
> - the configuration for CI builds must be kept in the Arrow repository
>(as they are currently in .github, etc.)
> - CI builds must be runnable from PRs
>
> Regards
>
> Antoine.
>
>
> Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :
> > Hi,
> >
> > The Apache Github Actions agent pool seems to be oversubscribed as
> > more Apache projects migrate their CI setup to GHA. We experienced
> > pretty solid feedback times (~20-30m) when we originally moved to GHA
> > but now we are roughly 5hrs behind [1].
> >
> > Based on other projects' complaints and discussions [2][3] (doesn't
> > have all the links at hand) we can't expect a short term solution from
> > infra. I think we *need* to figure out something on the project level
> > instead to maintain the overall project health and to improve the
> > development velocity.
> >
> > I don't have a concrete proposal at the moment, but we should start to
> > collect the available options. Ideas?
> >
> > Thanks, Krisztian
> >
> > [1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
> > [2]: https://github.com/apache/pulsar/issues/9154
> > [3]: https://issues.apache.org/jira/browse/SPARK-34053
> >


Re: CI feedback time

2021-04-14 Thread Antoine Pitrou



Hi Krisztian,

Thanks for bringing this up.  This is definitely becoming a 
high-priority topic for Arrow development.


I don't believe there is much opportunity for reducing the number of 
builds or their runtime.  We simply have a lot of development going on, 
and the number of different CI jobs we have is simply because we need to 
support many different configurations (and past experience has shown 
that they quickly stop working if we don't monitor them on a regular basis).


So I think the only path forward is to build up (== buy, probably) our 
own execution resources for CI.  Whether that entails using Github 
self-hosted runners, Buildkite, or yet another system, I have no idea.


I'll submit two requirements though:
- the configuration for CI builds must be kept in the Arrow repository
  (as they are currently in .github, etc.)
- CI builds must be runnable from PRs

Regards

Antoine.


Le 15/04/2021 à 00:14, Krisztián Szűcs a écrit :

Hi,

The Apache Github Actions agent pool seems to be oversubscribed as
more Apache projects migrate their CI setup to GHA. We experienced
pretty solid feedback times (~20-30m) when we originally moved to GHA
but now we are roughly 5hrs behind [1].

Based on other projects' complaints and discussions [2][3] (doesn't
have all the links at hand) we can't expect a short term solution from
infra. I think we *need* to figure out something on the project level
instead to maintain the overall project health and to improve the
development velocity.

I don't have a concrete proposal at the moment, but we should start to
collect the available options. Ideas?

Thanks, Krisztian

[1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
[2]: https://github.com/apache/pulsar/issues/9154
[3]: https://issues.apache.org/jira/browse/SPARK-34053



CI feedback time

2021-04-14 Thread Krisztián Szűcs
Hi,

The Apache Github Actions agent pool seems to be oversubscribed as
more Apache projects migrate their CI setup to GHA. We experienced
pretty solid feedback times (~20-30m) when we originally moved to GHA
but now we are roughly 5hrs behind [1].

Based on other projects' complaints and discussions [2][3] (doesn't
have all the links at hand) we can't expect a short term solution from
infra. I think we *need* to figure out something on the project level
instead to maintain the overall project health and to improve the
development velocity.

I don't have a concrete proposal at the moment, but we should start to
collect the available options. Ideas?

Thanks, Krisztian

[1]: https://github.com/apache/arrow/actions?query=is%3Ain_progress
[2]: https://github.com/apache/pulsar/issues/9154
[3]: https://issues.apache.org/jira/browse/SPARK-34053