I like the idea of immutable cache. I would be even tempted to mix it with some form of load balancing in case of celery-like executors to optimise the performance. And that may be also extended to support not only venvs but also in-memory cache (for example for ML models) as you mentioned.
T. On Fri, Jan 8, 2021 at 7:19 PM Jarek Potiuk <[email protected]> wrote: > As discussed before in the discussion. The idea is intriguing and I think > it opens up an easier automated management for different teams or even > writing isolated tasks. I have few main concerns: > > 1) the general management (mentioned by Tomek). For Celery and Local > Executor several workers can co-exist and even if we "name" virtualenv as > proposed in option b) they can easily override each other in case several > tasks run on the same worker machine. > > There are many questions that need to be answered for management of the > venv and IMHO if we agree to some solution the design will have to > document behaviour of all those. I am not saying that we need to have > answers now to all of those but eventually the design should cover all that: > - how do we update the envs when they change? > - how do we know it has changed and needs an update? > - do we always try to update to the latest version? Is it > 'eager-upgrade' or "upgrade-if-needed"? > - what do we do when other tasks use the same venv while we are > upgrading it? should we wait until they finish? Or should we risk > inconsistencies/failures? > - do we lock the requirements to specific versions always or allow for > ranges of versions ? > - Which versions do we lock - only the dependencies in definition of > the venv or also transitive (pip freeze)? > - How do we cope when the version of airflow is upgraded with newer > dependencies which conflict with existing venvs? Should we delete them? > - How do we cope if we only add one provider in airflow which has > conflicting dependencies? This will be a far more frequent event and for > sure one that will be done even without restarting airflow I believe. > - How do we deal with a situation when there is a conflict or when > transitional dependency introduces new conflict and we cannot install venv > cleanly (do we revert ?) > - How do we deal with the situation when transitive dependency causes > failures after installation? > - how do we keep track which worker already has which version of the > venv ? > - if/when we delete those venvs ? > > There are more questions as well on when we add multiple workers - should > the venv be identical on different workers, or should we allow for > differences? what happens when one task starts on one worker, and then is > retried on a different one yielding different results > Finally - what happens is tasks from the past are re-executed (using > backfill). Should they use the new version of the venv or the one they were > executed with previously? How to keep track of it? > > 2) KubernetesExecutor: > - here the usefulness of optimisations is very limited IMHO if we just > 'install" the .venv. It will work and there is an obvious isolation (but it > is obviously slower). > > I think the "naive" install-what-you-specify approach suffers from all > that. However I think we might approach it differently. > > My proposal to consider: > > > > Possibly a lot (if not all) of those problems can be solved if we > introduce a concept known for example from GithubActions (or other CI > systems). - immutable cache. We are really talking about immutable cache of > 'venvs" here - no more, no less and way to manage them. > In the CI system such immutable caches are used for the very purpose - > storing venvs is a basic utility of those. And since those systems are > already used in similar environments (muti-workers setup, rerunning stuff, > etc.) seems that this problem is already solved. > > - CircleCI caching: https://circleci.com/docs/2.0/caching/ > - GA caching: > > https://docs.github.com/en/free-pro-team@latest/actions/guides/caching-dependencies-to-speed-up-workflows > - TraviCI caching: https://docs.travis-ci.com/user/caching/ > > For this approach, managing venvs and other envs is largely a solved > problem: > > 1) you define a way to determine the unique id (usually md5 hash of files > contributing). In our case the hash might be based on "requirements.txt + > airflow version + all provider versions" to account for seamless upgrades > of airflow and providers. > 2) you have a central cache storage where you keep the cache > 3) cache is immutable - once you create it it stays untouched (though it > allows some initial short-term duplication when several tasks create same > cache - but only one succeeds) > 4) the only way to change the cache is to modify the sources (definition, > requests etc) > 5) cache is uploaded to a central location - this has the great benefit to > optimise the speed of venv creations across multiple workers AND include > KubernetesExecutor as well! > 6) this solution can then be applied to anything: > - caching .npm if someone wants to run their javascript > - preparing and caching machine learning models > - caching .maven deps for java > ..... > > I believe while answering the questions above, we will have to figure out > something very similar to this pattern, so IMHO, this is a great > opportunity to 'stand on the shoulders of giants' and rather than > reinventing the wheel, we should simply implement such immutable cache and > use it for .venv. > > > J. > > > > On Fri, Jan 8, 2021 at 6:20 PM Martineau, Constance <[email protected]> > wrote: > >> This is a problem I have been grappling with for some time, and am >> intrigued by your proposal! For departments that are under the same >> "umbrella" department but are comprised of many different decentralized >> smaller groups, the ability to see and interact with other people's dags >> via the UI - while also managing their own dependencies - would be a >> feature, not a bug. The Airflow UI, with descriptive dag and task ids, are >> a visual cue for the subgroups to speak with one another to leverage each >> other's work as a base instead of reinventing the wheel. >> >> How would this work if one were using the KubernetesExecutor? >> >> Cheers, >> Constance >> >> / CONSTANCE MARTINEAU | Développeuse Principale, Platformes et >> Exploitation | Tél: 514-847-7992 | [email protected] >> >> -----Original Message----- >> From: Tomasz Urbaszek <[email protected]> >> Sent: Friday, January 8, 2021 11:48 AM >> To: [email protected] >> Subject: Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow >> >> Thanks Zacharya! I think this is an interesting idea. It seems to be a >> simple way to add "multi tenancy" to Airflow. However, I'm afraid that >> having separate venvs for separate tasks (=teams) solves only dependency >> management problem not managing multiple teams using the same cluster >> (different teams still can view, delete, update DAGs of other teams and >> they share the same db). As far as I know there's no "production grade" way >> to deploy a single Airflow instance that can be shared by teams. Although I >> know there are ways to achieve that using GKE cluster with multiple >> instances of Airflow. >> >> Now, focusing on your proposition. If we will decide that we want to >> support such a feature we will need to think about retention policy for the >> venvs. Having multiple venvs will increase disk/memory allocation and we >> should try to remove venvs that are no longer used. >> In general we will need a mechanism to manage those venvs (define, >> create, update, delete). >> >> That said I'm not convinced if Airflow should take care of its >> environment/deployment. In my opinion this is a users' task to make sure >> that their tasks are executed in the right environment. And I think this is >> easily achievable with multiple workers/queues and docker images that can >> be build on CI/CD systems. >> >> Cheers, >> Tomek >> >> On Fri, Jan 8, 2021 at 5:26 PM Zacharya Haitin <[email protected]> >> wrote: >> > >> > Hi everyone, >> > >> > Currently, there is no easy way to manage an Airflow cluster that >> contains multiple teams and python based DAGS. >> > I submitted the following AIP with a suggestion how to make the venv >> management part of the Airflow executors lifecycle and would love to get >> your feedback. >> > >> > My suggestion should be pretty easy to implement and will help the >> users with python packages deployments. >> > >> > AIP: >> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik >> > i.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-37%2BVirtualenv%2B >> > management%2Binside%2BAirflow&data=04%7C01%7Ccomartineau%40cdpq.co >> > m%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec3843fe27c41a6395 >> > 7%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA >> > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=O >> > uFDY%2F9PQpYZ7xuW%2BX65GwlQ2OA3HyY8G1GojLQknG8%3D&reserved=0 >> > Issue: >> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith >> > ub.com%2Fapache%2Fairflow%2Fissues%2F13364&data=04%7C01%7Ccomartin >> > eau%40cdpq.com%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec384 >> > 3fe27c41a63957%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8ey >> > JWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100 >> > 0&sdata=Pukf%2FxlUdF%2BCk3Q9NEbNaSEkPBUzh4e3DkntvGY0P0c%3D&res >> > erved=0 >> > >> > Please let me know if you have any questions or suggestions on how to >> improve this solution, or if you disagree with my approach. >> > >> > Thanks, >> > Zacharya. >> >> ________________________________ >> >> Avis de confidentialité : Ce courriel et les pièces qui y sont jointes >> contiennent de l'information confidentielle et peuvent être protégés par le >> secret professionnel ou constituer de l’information privilégiée. Ils sont >> destinés à l'usage exclusif de la (des) personne(s) à qui ils sont >> adressés. Si vous n'êtes pas le destinataire visé ou la personne chargée de >> transmettre ce document à son destinataire, vous êtes avisé par la présente >> que toute divulgation, reproduction, copie, distribution ou autre >> utilisation de cette information est strictement interdite. Si vous avez >> reçu ce courriel par erreur, veuillez en aviser immédiatement l’expéditeur >> par téléphone ainsi que détruire et effacer l'information que vous avez >> reçue de tout disque dur ou autre média sur lequel elle peut être >> enregistrée et ne pas en conserver de copie. Merci de votre collaboration. >> >> ________________________________ >> >> Notice of Confidentiality: This electronic mail message, including any >> attachments, is confidential and may be privileged and protected by >> professional secrecy. They are intended for the exclusive use of the >> addressee. If you are not the intended addressee or the person responsible >> for delivering this document to the intended addressee, you are hereby >> advised that any disclosure, reproduction, copy, distribution or other use >> of this information is strictly forbidden. If you have received this >> document by mistake, please immediately inform the sender by telephone, >> destroy and delete the information received from any hard disk or any media >> on which it may have been registered and do not keep any copy. Thank you >> for your cooperation. >> > > > -- > +48 660 796 129 >
