Re: Following up on the migration of the GA runners over to Google Cloud.

Jarek Potiuk Wed, 11 Aug 2021 12:37:02 -0700

Just one more caveat and few comments so that you realise the full scope
(of at least what I know) of the danger and can make informed decisions.

I think you should simply weigh the risks vs. costs. As usual with
security, there is never a 0-1 case, it's always how much investment you
can do to mitigate some known risks and what is the cost of potential
'breach".

Costs:

* our solution with patched Runner relies on periodic updates and
re-releasing of the patched runner. GA have the policy that when they
release a new one, the old one stops working "a few days later".. This
happened to us today. We do not have the process fully automated and some
parts of it (building the AMI and deploying it)  is done
semi-automatically. And we had a few days of delay and the person doing it
was on vacations ... It all ended up with ~20 hours of downtime for our CI
tests today :( . So if you are going the "patching" route, be prepared for
some maintenance overhead and disruption (We can probably automate more but
just the nature of it and the fact that you can test it at most once every
few weeks when they release a new version, makes it "brittle". But we run
it for many months and other than occasional disruptions, it looks like a
"workable" solution.

Risks:

* I believe you could have containers in GKE deployed with the runners in
containers instead of VMs (to be confirmed! we never tried it), as long as
you make a DinD working for those containers. In fact those are also our
plans. We already secured funds from the Cloud Composer team and we are
planning to run those runners in GCP.  Actually running them as GKE
containers (and killing each container after the job is done) was my
thinking as well. That might then be really possible to run an unpatched
version for most of the concerns about "cleaning the environment". One of
the big concerns of security for the VM setup is that user A can
potentially influence a follow-up build of user B in ways that might have
some adverse effect. By proper containerization and one-time-container-use,
at least that part is addressed.
I might actually be happy to help (and I am quite sure other team members
from airflow CI team) and we could come up with a "reusable" solution for
hosting that we could reuse between projects. Also if you don't use GA to
push any publicly available, user-facing artifacts (you should not,
really), the danger is really minimal here IMHO.

* however this in-container approach  does not address several other
concerns (fully) - but we are still in a much better situation than say in
February. One of the problems is that such a setup might be easily used for
crypto mining. And yes, this is a REAL concern. It's the reason why GitHub
introduced "approve to run" buttons to protect their GitHub Runners from
being exploited
https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/
. Maybe this is actually quite an acceptable risk, taking into account that
we could also do some monitoring and flag such cases and the same "Approve
to run" works for self-hosted runners as well. You can also limit the CPU
usage / disk usage etc, limiting total number of jobs etc. If we can risk
it and have some mechanism to react in case our monitoring detects some
anomaly (for example - pausing the workflows or switching them to
PublicRunners temporarily) - just monitoring for potential abuse could be
enough. And simply making it "limited" capacity can make it a very poor
target for the bad players - paired with the "Approve to run" and we should
be safe.

* yet another (and this one is I think still not fully addressed) problem
is that when you run your builds in "pull_request_target" workflows, the
jobs there might have access to "write" tokens that can give potential
adversary uncontrolled access to your repo, packages, etc. this is the most
dangerous part, actually and one that is really dangerous. It has been
somewhat mitigated by the recent introduction of permission control (BTW. I
raised a bounty on that one in December and they told me I will not get it
because they "knew it" in December):
https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/.
However if you are using any kind of "pull_request_target" or similar
workflows where you have to enable "Write" access to your repo, you have to
be extra careful there (and there is a risk of exploiting your mistakes
there in ways that might get unnoticed). However, here, I think proper and
careful code review of the "workflow" related parts is crucial.For example
in Airflow we run everything "substantial" during the build in docker
containers - in GitHub Runner host we execute merely scripts that prepare
variables, build docker images and then run everything "substantial" in
those docker containers, isolating them from the host. This might be a good
strategy to limit the risk here.

My current thinking is, that you could get rather secure solution without
patching the runner, when you combine:

* GA control permissions
* GA approve to run
* single-run containerisation of runners
* further containerisation (DinD) of substantial work in workflows that
require write access
* careful review process for workflow-related changes

J.

On Wed, Aug 11, 2021 at 8:27 PM Fernando Morales Martinez <
[email protected]> wrote:

> Hi team!
>
> As you may know, I've been trying to migrate the GitHub Actions runners
> over to Google Cloud.
> A few days ago, Jarek brought to my attention a few security issues (it
> looks like Jenkins doesn't suffer from these concerns) that would come up
> were I to use the version of the GA runners provided by GitHub.
>
> There is a workaround that requires using a patched version of the
> original GA runners. A caveat here is that those Actions runners need to be
> executed on VMs instead of containers.
>
> As far as I know, in Beam we've prioritized ease of contribution (i.e. run
> the tests right away) over the concern of having arbitrary code come from a
> pull request.
>
> The way I see it, we have two options here:
>
> 1.- Proceed with the use of default GA runners on Google's Kubernetes
> engine (GKE) and allow everyone that creates a pull request to run the Java
> and Python tests, or
>
> 2.- Use the patched GA runners (which allows us to specify which
> contributors/accounts can execute Actions runners) and deploy them to VMs
> on Google Compute Engine; this in case we would like to restrict access in
> the future.
>
> What are your thoughts on this?
>
> Thanks for the input!
>
> - Fer
>
>
>
>
>
>
>
>
> *This email and its contents (including any attachments) are being sent
> toyou on the condition of confidentiality and may be protected by
> legalprivilege. Access to this email by anyone other than the intended
> recipientis unauthorized. If you are not the intended recipient, please
> immediatelynotify the sender by replying to this message and delete the
> materialimmediately from your system. Any further use, dissemination,
> distributionor reproduction of this email is strictly prohibited. Further,
> norepresentation is made with respect to any content contained in this
> email.*

-- 
+48 660 796 129

Re: Following up on the migration of the GA runners over to Google Cloud.

Reply via email to