Re: Following up on the migration of the GA runners over to Google Cloud.

Fernando Morales Martinez Wed, 08 Sep 2021 19:09:12 -0700

Hi team!
Sorry it took this long to come back to this one. I just pushed code to
this branch <https://github.com/fernando-wizeline/beam/tree/BEAM-12812> that
takes care of the following:
1.- Creates Docker image and entry point for installing and registering
self-hosted runners.
2.- Adds a GKE deployment configuration for the github actions runners.
3.- Sets all the workflows to run in a self-hosted environment.
4.- Builds the container and registers it in gcr.io.


As for the "approve to run" approach, it is my understanding that that is
already implemented by default for everyone but contributors; is that
correct?

I'm still figuring out how to implement DinD for the project.

Anyway, I figured the review of the aforementioned code changes contained
in the branch <https://github.com/fernando-wizeline/beam/tree/BEAM-12812> could
be a first step in resuming the work on this.

Please let me know what you think.

- Fernando Morales


On Mon, Aug 30, 2021 at 11:07 AM Ahmet Altay <[email protected]> wrote:

> Thank you Fernando!
>
> On Mon, Aug 30, 2021 at 10:00 AM Fernando Morales Martinez <
> [email protected]> wrote:
>
>> Hi Ahmet!
>> I'm working on a draft to share with Jarek and the rest of the team.
>> Will send it today before EOD to this thread.
>> Thanks for the help!
>>
>> On Mon, Aug 30, 2021 at 10:54 AM Ahmet Altay <[email protected]> wrote:
>>
>>> Fernando, just checking. Are you still blocked on this? Do you have
>>> other questions?
>>>
>>> On Thu, Aug 19, 2021 at 4:41 PM Ahmet Altay <[email protected]> wrote:
>>>
>>>> Thank you Fernando, Jarek for the detailed information.
>>>>
>>>> I do not have much experience related to abuse however based on what I
>>>> am reading, my suggestion would be to avoid using patched GA runners.
>>>> Patched images would require ongoing maintenance and would easily put us at
>>>> the risk of unplanned outages. I would suggest going down the path of using
>>>> the default GA runners on GKE, and secure it as much as possible by
>>>> implementing Jarek's suggestions (if not all, depending what is feasible).
>>>>
>>>> Fernando, does that answer your question?
>>>>
>>>> Ahmet
>>>>
>>>> On Tue, Aug 17, 2021 at 3:10 PM Kiley Sok <[email protected]> wrote:
>>>>
>>>>> +Kenneth Knowles <[email protected]> +Lukasz Cwik <[email protected]>
>>>>>   Thoughts?
>>>>> cc: +Ahmet Altay <[email protected]>
>>>>>
>>>>> More context in the previous email thread
>>>>>
>>>>> https://lists.apache.org/thread.html/r7fb0f8ec9042adacca8dae622a807a138e87d6656a7049cb59874c9e%40%3Cdev.beam.apache.org%3E
>>>>>
>>>>> On Wed, Aug 11, 2021 at 12:36 PM Jarek Potiuk <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Just one more caveat and few comments so that you realise the full
>>>>>> scope (of at least what I know) of the danger and can make informed
>>>>>> decisions.
>>>>>>
>>>>>> I think you should simply weigh the risks vs. costs. As usual with
>>>>>> security, there is never a 0-1 case, it's always how much investment you
>>>>>> can do to mitigate some known risks and what is the cost of potential
>>>>>> 'breach".
>>>>>>
>>>>>> Costs:
>>>>>>
>>>>>> * our solution with patched Runner relies on periodic updates and
>>>>>> re-releasing of the patched runner. GA have the policy that when they
>>>>>> release a new one, the old one stops working "a few days later".. This
>>>>>> happened to us today. We do not have the process fully automated and some
>>>>>> parts of it (building the AMI and deploying it)  is done
>>>>>> semi-automatically. And we had a few days of delay and the person doing 
>>>>>> it
>>>>>> was on vacations ... It all ended up with ~20 hours of downtime for our 
>>>>>> CI
>>>>>> tests today :( . So if you are going the "patching" route, be prepared 
>>>>>> for
>>>>>> some maintenance overhead and disruption (We can probably automate more 
>>>>>> but
>>>>>> just the nature of it and the fact that you can test it at most once 
>>>>>> every
>>>>>> few weeks when they release a new version, makes it "brittle". But we run
>>>>>> it for many months and other than occasional disruptions, it looks like a
>>>>>> "workable" solution.
>>>>>>
>>>>>> Risks:
>>>>>>
>>>>>> * I believe you could have containers in GKE deployed with the
>>>>>> runners in containers instead of VMs (to be confirmed! we never tried 
>>>>>> it),
>>>>>> as long as you make a DinD working for those containers. In fact those 
>>>>>> are
>>>>>> also our plans. We already secured funds from the Cloud Composer team and
>>>>>> we are planning to run those runners in GCP.  Actually running them as 
>>>>>> GKE
>>>>>> containers (and killing each container after the job is done) was my
>>>>>> thinking as well. That might then be really possible to run an unpatched
>>>>>> version for most of the concerns about "cleaning the environment". One of
>>>>>> the big concerns of security for the VM setup is that user A can
>>>>>> potentially influence a follow-up build of user B in ways that might have
>>>>>> some adverse effect. By proper containerization and 
>>>>>> one-time-container-use,
>>>>>> at least that part is addressed.
>>>>>> I might actually be happy to help (and I am quite sure other team
>>>>>> members from airflow CI team) and we could come up with a "reusable"
>>>>>> solution for hosting that we could reuse between projects. Also if you
>>>>>> don't use GA to push any publicly available, user-facing artifacts (you
>>>>>> should not, really), the danger is really minimal here IMHO.
>>>>>>
>>>>>> * however this in-container approach  does not address several other
>>>>>> concerns (fully) - but we are still in a much better situation than say 
>>>>>> in
>>>>>> February. One of the problems is that such a setup might be easily used 
>>>>>> for
>>>>>> crypto mining. And yes, this is a REAL concern. It's the reason why 
>>>>>> GitHub
>>>>>> introduced "approve to run" buttons to protect their GitHub Runners from
>>>>>> being exploited
>>>>>> https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/
>>>>>> . Maybe this is actually quite an acceptable risk, taking into account 
>>>>>> that
>>>>>> we could also do some monitoring and flag such cases and the same 
>>>>>> "Approve
>>>>>> to run" works for self-hosted runners as well. You can also limit the CPU
>>>>>> usage / disk usage etc, limiting total number of jobs etc. If we can risk
>>>>>> it and have some mechanism to react in case our monitoring detects some
>>>>>> anomaly (for example - pausing the workflows or switching them to
>>>>>> PublicRunners temporarily) - just monitoring for potential abuse could be
>>>>>> enough. And simply making it "limited" capacity can make it a very poor
>>>>>> target for the bad players - paired with the "Approve to run" and we 
>>>>>> should
>>>>>> be safe.
>>>>>>
>>>>>> * yet another (and this one is I think still not fully addressed)
>>>>>> problem is that when you run your builds in "pull_request_target"
>>>>>> workflows, the jobs there might have access to "write" tokens that can 
>>>>>> give
>>>>>> potential adversary uncontrolled access to your repo, packages, etc. this
>>>>>> is the most dangerous part, actually and one that is really dangerous. It
>>>>>> has been somewhat mitigated by the recent introduction of permission
>>>>>> control (BTW. I raised a bounty on that one in December and they told me 
>>>>>> I
>>>>>> will not get it because they "knew it" in December):
>>>>>> https://github.blog/changelog/2021-04-20-github-actions-control-permissions-for-github_token/.
>>>>>> However if you are using any kind of "pull_request_target" or similar
>>>>>> workflows where you have to enable "Write" access to your repo, you have 
>>>>>> to
>>>>>> be extra careful there (and there is a risk of exploiting your mistakes
>>>>>> there in ways that might get unnoticed). However, here, I think proper 
>>>>>> and
>>>>>> careful code review of the "workflow" related parts is crucial.For 
>>>>>> example
>>>>>> in Airflow we run everything "substantial" during the build in docker
>>>>>> containers - in GitHub Runner host we execute merely scripts that prepare
>>>>>> variables, build docker images and then run everything "substantial" in
>>>>>> those docker containers, isolating them from the host. This might be a 
>>>>>> good
>>>>>> strategy to limit the risk here.
>>>>>>
>>>>>> My current thinking is, that you could get rather secure solution
>>>>>> without patching the runner, when you combine:
>>>>>>
>>>>>> * GA control permissions
>>>>>> * GA approve to run
>>>>>> * single-run containerisation of runners
>>>>>> * further containerisation (DinD) of substantial work in workflows
>>>>>> that require write access
>>>>>> * careful review process for workflow-related changes
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 11, 2021 at 8:27 PM Fernando Morales Martinez <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi team!
>>>>>>>
>>>>>>> As you may know, I've been trying to migrate the GitHub Actions
>>>>>>> runners over to Google Cloud.
>>>>>>> A few days ago, Jarek brought to my attention a few security issues
>>>>>>> (it looks like Jenkins doesn't suffer from these concerns) that would 
>>>>>>> come
>>>>>>> up were I to use the version of the GA runners provided by GitHub.
>>>>>>>
>>>>>>> There is a workaround that requires using a patched version of the
>>>>>>> original GA runners. A caveat here is that those Actions runners need 
>>>>>>> to be
>>>>>>> executed on VMs instead of containers.
>>>>>>>
>>>>>>> As far as I know, in Beam we've prioritized ease of contribution
>>>>>>> (i.e. run the tests right away) over the concern of having arbitrary 
>>>>>>> code
>>>>>>> come from a pull request.
>>>>>>>
>>>>>>> The way I see it, we have two options here:
>>>>>>>
>>>>>>> 1.- Proceed with the use of default GA runners on Google's
>>>>>>> Kubernetes engine (GKE) and allow everyone that creates a pull request 
>>>>>>> to
>>>>>>> run the Java and Python tests, or
>>>>>>>
>>>>>>> 2.- Use the patched GA runners (which allows us to specify which
>>>>>>> contributors/accounts can execute Actions runners) and deploy them to 
>>>>>>> VMs
>>>>>>> on Google Compute Engine; this in case we would like to restrict access 
>>>>>>> in
>>>>>>> the future.
>>>>>>>
>>>>>>> What are your thoughts on this?
>>>>>>>
>>>>>>> Thanks for the input!
>>>>>>>
>>>>>>> - Fer
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *This email and its contents (including any attachments) are being
>>>>>>> sent toyou on the condition of confidentiality and may be protected by
>>>>>>> legalprivilege. Access to this email by anyone other than the intended
>>>>>>> recipientis unauthorized. If you are not the intended recipient, please
>>>>>>> immediatelynotify the sender by replying to this message and delete the
>>>>>>> materialimmediately from your system. Any further use, dissemination,
>>>>>>> distributionor reproduction of this email is strictly prohibited. 
>>>>>>> Further,
>>>>>>> norepresentation is made with respect to any content contained in this
>>>>>>> email.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> +48 660 796 129 <+48%20660%20796%20129>
>>>>>>
>>>>>
>>
>>
>>
>>
>>
>>
>>
>> *This email and its contents (including any attachments) are being sent
>> toyou on the condition of confidentiality and may be protected by
>> legalprivilege. Access to this email by anyone other than the intended
>> recipientis unauthorized. If you are not the intended recipient, please
>> immediatelynotify the sender by replying to this message and delete the
>> materialimmediately from your system. Any further use, dissemination,
>> distributionor reproduction of this email is strictly prohibited. Further,
>> norepresentation is made with respect to any content contained in this
>> email.*
>
>

-- 
*This email and its contents (including any attachments) are being sent to
you on the condition of confidentiality and may be protected by legal
privilege. Access to this email by anyone other than the intended recipient
is unauthorized. If you are not the intended recipient, please immediately
notify the sender by replying to this message and delete the material
immediately from your system. Any further use, dissemination, distribution
or reproduction of this email is strictly prohibited. Further, no
representation is made with respect to any content contained in this email.*

Re: Following up on the migration of the GA runners over to Google Cloud.

Reply via email to