potiuk commented on PR #60270:
URL: https://github.com/apache/airflow/pull/60270#issuecomment-3729119122

   
   
   >  Since run_as_user is not always global(from config) and can be per dag, 
will it not be an issue with this approach?
   
   Yeah. @ephraimbuddy is very right about it.  
   
   
   > 
   > 1. What sort of issue do you foresee?
   
   I think that was mostly a comment to @amoghrajesh 's suggestion. 
   
   >> Why can we not consider initing the bundle after we know "who" will run 
the task? Maybe an early pass to detect "run_as_user" and then actually init 
the bundle with that user?
   
   The thing is that we can have celery or edge workers wher you have  multiple 
tasks run on the same machine and sharing (supervisor) processes that are long 
running. And in this case, you can have the same `bare` repository used by 
multiple tasks using different users - including those using `airflow` user. 
And in this case what happens is that all those multiple users should be able 
to execute fetch - in case they want to use commit that had not yet been 
fetched. And it will fail if they have no write access to the repo.
   
   
   > 2. How do you suggest to deal with it?
   
   I think if you consider this, I thought initially that the  "recommended" 
solution should be the one I suggested in slack initially as a workaround - 
i.e. umask  with writable access for group + all users used for impersonation 
(and airlfow user) have to share the same group. That might be of course 
relaxed if you you have a filesystem that allows to implement those permissions 
in a different way - like NFS -  but generallly our recommendation should be 
"make sure airfow and all the users used for impersonation can have write 
access to the bare git repo - this can be done for example by ... <umask + 
group>. 
   
   However after a bit of thinking - this one leaks a bit from airlfow into the 
deployment and likely makes things more complex.
   
   I think, however, there is another possibility - and maybe we could consider 
it - instead of running `git fetch` by the task, we could add extra command to 
the executin API  - so that task can ask supervisor to fetch the commit. This 
could solve all the problems and only require read permission to the bare 
repository in order to checkout the commit by task after it's been fetched by 
the supervisor. That has another advantage as well, we could potentially 
serialize those requests or batch them in supervisor and tha would avoid 
another quite likely potential issue where multiple tasks of the same dag run 
are running in parallel (which will pretty much always happen if we have mapped 
tasks) and asking to fetch the repository at the same time - this will work but 
it is not optimal - potentially many git conections opened, rate limiting might 
start playing a role etc. So leaving all the `git fetch` operations to the 
supervisor seems like a good idea.
   
   That however I think will require a bit more coupling between the bundle 
interface and execution api - because this is not exclusively a problem of 
GitBundle only - other bundles might have very similar problems and we should 
solve it in a generic way so that  there is a generic  task -> supervisor ("get 
bundle  version NNN") has to be added to the execution 
   API.
   
   But I see that as the best long-term option.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to