potiuk commented on PR #60270:
URL: https://github.com/apache/airflow/pull/60270#issuecomment-3729119122
> Since run_as_user is not always global(from config) and can be per dag,
will it not be an issue with this approach?
Yeah. @ephraimbuddy is very right about it.
>
> 1. What sort of issue do you foresee?
I think that was mostly a comment to @amoghrajesh 's suggestion.
>> Why can we not consider initing the bundle after we know "who" will run
the task? Maybe an early pass to detect "run_as_user" and then actually init
the bundle with that user?
The thing is that we can have celery or edge workers wher you have multiple
tasks run on the same machine and sharing (supervisor) processes that are long
running. And in this case, you can have the same `bare` repository used by
multiple tasks using different users - including those using `airflow` user.
And in this case what happens is that all those multiple users should be able
to execute fetch - in case they want to use commit that had not yet been
fetched. And it will fail if they have no write access to the repo.
> 2. How do you suggest to deal with it?
I think if you consider this, I thought initially that the "recommended"
solution should be the one I suggested in slack initially as a workaround -
i.e. umask with writable access for group + all users used for impersonation
(and airlfow user) have to share the same group. That might be of course
relaxed if you you have a filesystem that allows to implement those permissions
in a different way - like NFS - but generallly our recommendation should be
"make sure airfow and all the users used for impersonation can have write
access to the bare git repo - this can be done for example by ... <umask +
group>.
However after a bit of thinking - this one leaks a bit from airlfow into the
deployment and likely makes things more complex.
I think, however, there is another possibility - and maybe we could consider
it - instead of running `git fetch` by the task, we could add extra command to
the executin API - so that task can ask supervisor to fetch the commit. This
could solve all the problems and only require read permission to the bare
repository in order to checkout the commit by task after it's been fetched by
the supervisor. That has another advantage as well, we could potentially
serialize those requests or batch them in supervisor and tha would avoid
another quite likely potential issue where multiple tasks of the same dag run
are running in parallel (which will pretty much always happen if we have mapped
tasks) and asking to fetch the repository at the same time - this will work but
it is not optimal - potentially many git conections opened, rate limiting might
start playing a role etc. So leaving all the `git fetch` operations to the
supervisor seems like a good idea.
That however I think will require a bit more coupling between the bundle
interface and execution api - because this is not exclusively a problem of
GitBundle only - other bundles might have very similar problems and we should
solve it in a generic way so that there is a generic task -> supervisor ("get
bundle version NNN") has to be added to the execution
API.
But I see that as the best long-term option.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]