Just one more caveat and few comments so that you realise the full scope (of at least what I know) of the danger and can make informed decisions.
I think you should simply weigh the risks vs. costs. As usual with security, there is never a 0-1 case, it's always how much investment you can do to mitigate some known risks and what is the cost of potential 'breach". Costs: * our solution with patched Runner relies on periodic updates and re-releasing of the patched runner. GA have the policy that when they release a new one, the old one stops working "a few days later".. This happened to us today. We do not have the process fully automated and some parts of it (building the AMI and deploying it) is done semi-automatically. And we had a few days of delay and the person doing it was on vacations ... It all ended up with ~20 hours of downtime for our CI tests today :( . So if you are going the "patching" route, be prepared for some maintenance overhead and disruption (We can probably automate more but just the nature of it and the fact that you can test it at most once every few weeks when they release a new version, makes it "brittle". But we run it for many months and other than occasional disruptions, it looks like a "workable" solution. Risks: * I believe you could have containers in GKE deployed with the runners in containers instead of VMs (to be confirmed! we never tried it), as long as you make a DinD working for those containers. In fact those are also our plans. We already secured funds from the Cloud Composer team and we are planning to run those runners in GCP. Actually running them as GKE containers (and killing each container after the job is done) was my thinking as well. That might then be really possible to run an unpatched version for most of the concerns about "cleaning the environment". One of the big concerns of security for the VM setup is that user A can potentially influence a follow-up build of user B in ways that might have some adverse effect. By proper containerization and one-time-container-use, at least that part is addressed. I might actually be happy to help (and I am quite sure other team members from airflow CI team) and we could come up with a "reusable" solution for hosting that we could reuse between projects. Also if you don't use GA to push any publicly available, user-facing artifacts (you should not, really), the danger is really minimal here IMHO. * however this in-container approach does not address several other concerns (fully) - but we are still in a much better situation than say in February. One of the problems is that such a setup might be easily used for crypto mining. And yes, this is a REAL concern. It's the reason why GitHub introduced "approve to run" buttons to protect their GitHub Runners from being exploited https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/ . Maybe this is actually quite an acceptable risk, taking into account that we could also do some monitoring and flag such cases and the same "Approve to run" works for self-hosted runners as well. You can also limit the CPU usage / disk usage etc, limiting total number of jobs etc. If we can risk it and have some mechanism to react in case our monitoring detects some anomaly (for example - pausing the workflows or switching them to PublicRunners temporarily) - just monitoring for potential abuse could be enough. And simply making it "limited" capacity can make it a very poor target for the bad players - paired with the "Approve to run" and we should be safe. * yet another (and this one is I think still not fully addressed) problem is that when you run your builds in "pull_request_target" workflows, the jobs there might have access to "write" tokens that can give potential adversary uncontrolled access to your repo, packages, etc. this is the most dangerous part, actually and one that is really dangerous. It has been somewhat mitigated by the recent introduction of permission control (BTW. I raised a bounty on that one in December and they told me I will not get it because they "knew it" in December): https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/. However if you are using any kind of "pull_request_target" or similar workflows where you have to enable "Write" access to your repo, you have to be extra careful there (and there is a risk of exploiting your mistakes there in ways that might get unnoticed). However, here, I think proper and careful code review of the "workflow" related parts is crucial.For example in Airflow we run everything "substantial" during the build in docker containers - in GitHub Runner host we execute merely scripts that prepare variables, build docker images and then run everything "substantial" in those docker containers, isolating them from the host. This might be a good strategy to limit the risk here. My current thinking is, that you could get rather secure solution without patching the runner, when you combine: * GA control permissions * GA approve to run * single-run containerisation of runners * further containerisation (DinD) of substantial work in workflows that require write access * careful review process for workflow-related changes J. On Wed, Aug 11, 2021 at 8:27 PM Fernando Morales Martinez < [email protected]> wrote: > Hi team! > > As you may know, I've been trying to migrate the GitHub Actions runners > over to Google Cloud. > A few days ago, Jarek brought to my attention a few security issues (it > looks like Jenkins doesn't suffer from these concerns) that would come up > were I to use the version of the GA runners provided by GitHub. > > There is a workaround that requires using a patched version of the > original GA runners. A caveat here is that those Actions runners need to be > executed on VMs instead of containers. > > As far as I know, in Beam we've prioritized ease of contribution (i.e. run > the tests right away) over the concern of having arbitrary code come from a > pull request. > > The way I see it, we have two options here: > > 1.- Proceed with the use of default GA runners on Google's Kubernetes > engine (GKE) and allow everyone that creates a pull request to run the Java > and Python tests, or > > 2.- Use the patched GA runners (which allows us to specify which > contributors/accounts can execute Actions runners) and deploy them to VMs > on Google Compute Engine; this in case we would like to restrict access in > the future. > > What are your thoughts on this? > > Thanks for the input! > > - Fer > > > > > > > > > *This email and its contents (including any attachments) are being sent > toyou on the condition of confidentiality and may be protected by > legalprivilege. Access to this email by anyone other than the intended > recipientis unauthorized. If you are not the intended recipient, please > immediatelynotify the sender by replying to this message and delete the > materialimmediately from your system. Any further use, dissemination, > distributionor reproduction of this email is strictly prohibited. Further, > norepresentation is made with respect to any content contained in this > email.* -- +48 660 796 129
