Yep. My thought exactly with in-memory cache. On Fri, Jan 8, 2021 at 8:00 PM Tomasz Urbaszek <[email protected]> wrote:
> I like the idea of immutable cache. I would be even tempted to mix it with > some form of load balancing in case of celery-like executors to optimise > the performance. And that may be also extended to support not only venvs > but also in-memory cache (for example for ML models) as you mentioned. > > T. > > On Fri, Jan 8, 2021 at 7:19 PM Jarek Potiuk <[email protected]> wrote: > >> As discussed before in the discussion. The idea is intriguing and I think >> it opens up an easier automated management for different teams or even >> writing isolated tasks. I have few main concerns: >> >> 1) the general management (mentioned by Tomek). For Celery and Local >> Executor several workers can co-exist and even if we "name" virtualenv as >> proposed in option b) they can easily override each other in case several >> tasks run on the same worker machine. >> >> There are many questions that need to be answered for management of the >> venv and IMHO if we agree to some solution the design will have to >> document behaviour of all those. I am not saying that we need to have >> answers now to all of those but eventually the design should cover all that: >> - how do we update the envs when they change? >> - how do we know it has changed and needs an update? >> - do we always try to update to the latest version? Is it >> 'eager-upgrade' or "upgrade-if-needed"? >> - what do we do when other tasks use the same venv while we are >> upgrading it? should we wait until they finish? Or should we risk >> inconsistencies/failures? >> - do we lock the requirements to specific versions always or allow >> for ranges of versions ? >> - Which versions do we lock - only the dependencies in definition of >> the venv or also transitive (pip freeze)? >> - How do we cope when the version of airflow is upgraded with newer >> dependencies which conflict with existing venvs? Should we delete them? >> - How do we cope if we only add one provider in airflow which has >> conflicting dependencies? This will be a far more frequent event and for >> sure one that will be done even without restarting airflow I believe. >> - How do we deal with a situation when there is a conflict or when >> transitional dependency introduces new conflict and we cannot install venv >> cleanly (do we revert ?) >> - How do we deal with the situation when transitive dependency causes >> failures after installation? >> - how do we keep track which worker already has which version of the >> venv ? >> - if/when we delete those venvs ? >> >> There are more questions as well on when we add multiple workers - should >> the venv be identical on different workers, or should we allow for >> differences? what happens when one task starts on one worker, and then is >> retried on a different one yielding different results >> Finally - what happens is tasks from the past are re-executed (using >> backfill). Should they use the new version of the venv or the one they were >> executed with previously? How to keep track of it? >> >> 2) KubernetesExecutor: >> - here the usefulness of optimisations is very limited IMHO if we just >> 'install" the .venv. It will work and there is an obvious isolation (but it >> is obviously slower). >> >> I think the "naive" install-what-you-specify approach suffers from all >> that. However I think we might approach it differently. >> >> My proposal to consider: >> >> >> >> Possibly a lot (if not all) of those problems can be solved if we >> introduce a concept known for example from GithubActions (or other CI >> systems). - immutable cache. We are really talking about immutable cache of >> 'venvs" here - no more, no less and way to manage them. >> In the CI system such immutable caches are used for the very purpose - >> storing venvs is a basic utility of those. And since those systems are >> already used in similar environments (muti-workers setup, rerunning stuff, >> etc.) seems that this problem is already solved. >> >> - CircleCI caching: https://circleci.com/docs/2.0/caching/ >> - GA caching: >> >> https://docs.github.com/en/free-pro-team@latest/actions/guides/caching-dependencies-to-speed-up-workflows >> - TraviCI caching: https://docs.travis-ci.com/user/caching/ >> >> For this approach, managing venvs and other envs is largely a solved >> problem: >> >> 1) you define a way to determine the unique id (usually md5 hash of files >> contributing). In our case the hash might be based on "requirements.txt + >> airflow version + all provider versions" to account for seamless upgrades >> of airflow and providers. >> 2) you have a central cache storage where you keep the cache >> 3) cache is immutable - once you create it it stays untouched (though it >> allows some initial short-term duplication when several tasks create same >> cache - but only one succeeds) >> 4) the only way to change the cache is to modify the sources (definition, >> requests etc) >> 5) cache is uploaded to a central location - this has the great benefit >> to optimise the speed of venv creations across multiple workers AND include >> KubernetesExecutor as well! >> 6) this solution can then be applied to anything: >> - caching .npm if someone wants to run their javascript >> - preparing and caching machine learning models >> - caching .maven deps for java >> ..... >> >> I believe while answering the questions above, we will have to figure out >> something very similar to this pattern, so IMHO, this is a great >> opportunity to 'stand on the shoulders of giants' and rather than >> reinventing the wheel, we should simply implement such immutable cache and >> use it for .venv. >> >> >> J. >> >> >> >> On Fri, Jan 8, 2021 at 6:20 PM Martineau, Constance <[email protected]> >> wrote: >> >>> This is a problem I have been grappling with for some time, and am >>> intrigued by your proposal! For departments that are under the same >>> "umbrella" department but are comprised of many different decentralized >>> smaller groups, the ability to see and interact with other people's dags >>> via the UI - while also managing their own dependencies - would be a >>> feature, not a bug. The Airflow UI, with descriptive dag and task ids, are >>> a visual cue for the subgroups to speak with one another to leverage each >>> other's work as a base instead of reinventing the wheel. >>> >>> How would this work if one were using the KubernetesExecutor? >>> >>> Cheers, >>> Constance >>> >>> / CONSTANCE MARTINEAU | Développeuse Principale, Platformes et >>> Exploitation | Tél: 514-847-7992 | [email protected] >>> >>> -----Original Message----- >>> From: Tomasz Urbaszek <[email protected]> >>> Sent: Friday, January 8, 2021 11:48 AM >>> To: [email protected] >>> Subject: Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow >>> >>> Thanks Zacharya! I think this is an interesting idea. It seems to be a >>> simple way to add "multi tenancy" to Airflow. However, I'm afraid that >>> having separate venvs for separate tasks (=teams) solves only dependency >>> management problem not managing multiple teams using the same cluster >>> (different teams still can view, delete, update DAGs of other teams and >>> they share the same db). As far as I know there's no "production grade" way >>> to deploy a single Airflow instance that can be shared by teams. Although I >>> know there are ways to achieve that using GKE cluster with multiple >>> instances of Airflow. >>> >>> Now, focusing on your proposition. If we will decide that we want to >>> support such a feature we will need to think about retention policy for the >>> venvs. Having multiple venvs will increase disk/memory allocation and we >>> should try to remove venvs that are no longer used. >>> In general we will need a mechanism to manage those venvs (define, >>> create, update, delete). >>> >>> That said I'm not convinced if Airflow should take care of its >>> environment/deployment. In my opinion this is a users' task to make sure >>> that their tasks are executed in the right environment. And I think this is >>> easily achievable with multiple workers/queues and docker images that can >>> be build on CI/CD systems. >>> >>> Cheers, >>> Tomek >>> >>> On Fri, Jan 8, 2021 at 5:26 PM Zacharya Haitin <[email protected]> >>> wrote: >>> > >>> > Hi everyone, >>> > >>> > Currently, there is no easy way to manage an Airflow cluster that >>> contains multiple teams and python based DAGS. >>> > I submitted the following AIP with a suggestion how to make the venv >>> management part of the Airflow executors lifecycle and would love to get >>> your feedback. >>> > >>> > My suggestion should be pretty easy to implement and will help the >>> users with python packages deployments. >>> > >>> > AIP: >>> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik >>> > i.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-37%2BVirtualenv%2B >>> > management%2Binside%2BAirflow&data=04%7C01%7Ccomartineau%40cdpq.co >>> > m%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec3843fe27c41a6395 >>> > 7%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA >>> > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=O >>> > uFDY%2F9PQpYZ7xuW%2BX65GwlQ2OA3HyY8G1GojLQknG8%3D&reserved=0 >>> > Issue: >>> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith >>> > ub.com%2Fapache%2Fairflow%2Fissues%2F13364&data=04%7C01%7Ccomartin >>> > eau%40cdpq.com%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec384 >>> > 3fe27c41a63957%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8ey >>> > JWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100 >>> > 0&sdata=Pukf%2FxlUdF%2BCk3Q9NEbNaSEkPBUzh4e3DkntvGY0P0c%3D&res >>> > erved=0 >>> > >>> > Please let me know if you have any questions or suggestions on how to >>> improve this solution, or if you disagree with my approach. >>> > >>> > Thanks, >>> > Zacharya. >>> >>> ________________________________ >>> >>> Avis de confidentialité : Ce courriel et les pièces qui y sont jointes >>> contiennent de l'information confidentielle et peuvent être protégés par le >>> secret professionnel ou constituer de l’information privilégiée. Ils sont >>> destinés à l'usage exclusif de la (des) personne(s) à qui ils sont >>> adressés. Si vous n'êtes pas le destinataire visé ou la personne chargée de >>> transmettre ce document à son destinataire, vous êtes avisé par la présente >>> que toute divulgation, reproduction, copie, distribution ou autre >>> utilisation de cette information est strictement interdite. Si vous avez >>> reçu ce courriel par erreur, veuillez en aviser immédiatement l’expéditeur >>> par téléphone ainsi que détruire et effacer l'information que vous avez >>> reçue de tout disque dur ou autre média sur lequel elle peut être >>> enregistrée et ne pas en conserver de copie. Merci de votre collaboration. >>> >>> ________________________________ >>> >>> Notice of Confidentiality: This electronic mail message, including any >>> attachments, is confidential and may be privileged and protected by >>> professional secrecy. They are intended for the exclusive use of the >>> addressee. If you are not the intended addressee or the person responsible >>> for delivering this document to the intended addressee, you are hereby >>> advised that any disclosure, reproduction, copy, distribution or other use >>> of this information is strictly forbidden. If you have received this >>> document by mistake, please immediately inform the sender by telephone, >>> destroy and delete the information received from any hard disk or any media >>> on which it may have been registered and do not keep any copy. Thank you >>> for your cooperation. >>> >> >> >> -- >> +48 660 796 129 >> > -- +48 660 796 129
