I like the idea of immutable cache. I would be even tempted to mix it with
some form of load balancing in case of celery-like executors to optimise
the performance. And that may be also extended to support not only venvs
but also in-memory cache (for example for ML models) as you mentioned.

T.

On Fri, Jan 8, 2021 at 7:19 PM Jarek Potiuk <[email protected]> wrote:

> As discussed before in the discussion. The idea is intriguing and I think
> it opens up an easier automated management for different teams or even
> writing isolated tasks. I have few main concerns:
>
> 1) the general management (mentioned by Tomek). For Celery and Local
> Executor several workers can co-exist and even if we "name" virtualenv as
> proposed in option b) they can easily override each other in case several
> tasks run on the same worker machine.
>
> There are many questions that need to be answered for management of the
> venv and IMHO if we agree to some solution the design will have to
> document behaviour of all those. I am not saying that we need to have
> answers now to all of those but eventually the design should cover all that:
>      - how do we update the envs when they change?
>      - how do we know it has changed and needs an update?
>      - do we always try to update to the latest version? Is it
> 'eager-upgrade' or "upgrade-if-needed"?
>     - what do we do when other tasks use the same venv while we are
> upgrading it? should we wait until they finish? Or should we risk
> inconsistencies/failures?
>     - do we lock the requirements to specific versions always or allow for
> ranges of versions ?
>     - Which versions do we lock - only the dependencies in definition of
> the venv or also transitive (pip freeze)?
>     - How do we cope when the version of airflow is upgraded with newer
> dependencies which conflict with existing venvs? Should we delete them?
>     - How do we cope if we only add one provider in airflow which has
> conflicting dependencies? This will be a far more frequent event and for
> sure one that will be done even without restarting airflow I believe.
>     - How do we deal with a situation when there is a conflict or when
> transitional dependency introduces new conflict and we cannot install venv
> cleanly (do we revert ?)
>     - How do we deal with the situation when transitive dependency causes
> failures after installation?
>     - how do we keep track which worker already has which version of the
> venv ?
>     - if/when  we delete those venvs ?
>
> There are more questions as well on when we add multiple workers - should
> the venv be identical on different workers, or should we allow for
> differences? what happens when one task starts on one worker, and then is
> retried on a different one yielding different results
> Finally - what happens is tasks from the past are re-executed (using
> backfill). Should they use the new version of the venv or the one they were
> executed with previously? How to keep track of it?
>
> 2) KubernetesExecutor:
> - here the usefulness of optimisations is very limited IMHO if we just
> 'install" the .venv. It will work and there is an obvious isolation (but it
> is obviously slower).
>
> I think the "naive" install-what-you-specify approach suffers from all
> that. However I think we might approach it differently.
>
> My proposal to consider:
>
>
>
> Possibly a lot (if not all) of those problems can be solved if we
> introduce a concept known for example from GithubActions (or other CI
> systems). - immutable cache. We are really talking about immutable cache of
> 'venvs" here - no more, no less and way to manage them.
> In the CI system such immutable caches are used for the very purpose -
> storing venvs is a basic utility of those. And since those systems are
> already used in similar environments (muti-workers setup, rerunning stuff,
> etc.) seems that this problem is already solved.
>
>    - CircleCI caching: https://circleci.com/docs/2.0/caching/
>    - GA caching:
>    
> https://docs.github.com/en/free-pro-team@latest/actions/guides/caching-dependencies-to-speed-up-workflows
>    - TraviCI caching: https://docs.travis-ci.com/user/caching/
>
> For this approach, managing venvs and other envs is largely a solved
> problem:
>
> 1) you define a way to determine the unique id (usually md5 hash of files
> contributing). In our case the hash might be based on "requirements.txt +
> airflow version + all provider versions" to account for seamless upgrades
> of airflow and providers.
> 2) you have a central cache storage where you keep the cache
> 3) cache is immutable  - once you create it it stays untouched (though it
> allows some initial short-term duplication when several tasks create same
> cache  - but only one succeeds)
> 4) the only way to change the cache is to modify the sources (definition,
> requests etc)
> 5) cache is uploaded to a central location - this has the great benefit to
> optimise the speed of venv creations across multiple workers AND include
> KubernetesExecutor as well!
> 6) this solution can then be applied to anything:
>      - caching .npm if someone wants to run their javascript
>      - preparing and caching machine learning models
>      - caching .maven deps for java
>      .....
>
> I believe while answering the questions above, we will have to figure out
> something very similar to this pattern, so IMHO, this is a great
> opportunity to 'stand on the shoulders of giants' and rather than
> reinventing the wheel, we should simply implement such immutable cache and
> use it for .venv.
>
>
> J.
>
>
>
> On Fri, Jan 8, 2021 at 6:20 PM Martineau, Constance <[email protected]>
> wrote:
>
>> This is a problem I have been grappling with for some time, and am
>> intrigued by your proposal! For departments that are under the same
>> "umbrella" department but are comprised of many different decentralized
>> smaller groups, the ability to see and interact with other people's dags
>> via the UI - while also managing their own dependencies - would be a
>> feature, not a bug. The Airflow UI, with descriptive dag and task ids, are
>> a visual cue for the subgroups to speak with one another to leverage each
>> other's work as a base instead of reinventing the wheel.
>>
>> How would this work if one were using the KubernetesExecutor?
>>
>> Cheers,
>> Constance
>>
>> / CONSTANCE MARTINEAU  | Développeuse Principale, Platformes et
>> Exploitation | Tél: 514-847-7992 | [email protected]
>>
>> -----Original Message-----
>> From: Tomasz Urbaszek <[email protected]>
>> Sent: Friday, January 8, 2021 11:48 AM
>> To: [email protected]
>> Subject: Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow
>>
>> Thanks Zacharya! I think this is an interesting idea. It seems to be a
>> simple way to add "multi tenancy" to Airflow. However, I'm afraid that
>> having separate venvs for separate tasks (=teams) solves only dependency
>> management problem not managing multiple teams using the same cluster
>> (different teams still can view, delete, update DAGs of other teams and
>> they share the same db). As far as I know there's no "production grade" way
>> to deploy a single Airflow instance that can be shared by teams. Although I
>> know there are ways to achieve that using GKE cluster with multiple
>> instances of Airflow.
>>
>> Now, focusing on your proposition. If we will decide that we want to
>> support such a feature we will need to think about retention policy for the
>> venvs. Having multiple venvs will increase disk/memory allocation and we
>> should try to remove venvs that are no longer used.
>> In general we will need a mechanism to manage those venvs (define,
>> create, update, delete).
>>
>> That said I'm not convinced if Airflow should take care of its
>> environment/deployment. In my opinion this is a users' task to make sure
>> that their tasks are executed in the right environment. And I think this is
>> easily achievable with multiple workers/queues and docker images that can
>> be build on CI/CD systems.
>>
>> Cheers,
>> Tomek
>>
>> On Fri, Jan 8, 2021 at 5:26 PM Zacharya Haitin <[email protected]>
>> wrote:
>> >
>> > Hi everyone,
>> >
>> > Currently, there is no easy way to manage an Airflow cluster that
>> contains multiple teams and python based DAGS.
>> > I submitted the following AIP with a suggestion how to make the venv
>> management part of the Airflow executors lifecycle and would love to get
>> your feedback.
>> >
>> > My suggestion should be pretty easy to implement and will help the
>> users with python packages deployments.
>> >
>> > AIP:
>> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik
>> > i.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-37%2BVirtualenv%2B
>> > management%2Binside%2BAirflow&amp;data=04%7C01%7Ccomartineau%40cdpq.co
>> > m%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec3843fe27c41a6395
>> > 7%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA
>> > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O
>> > uFDY%2F9PQpYZ7xuW%2BX65GwlQ2OA3HyY8G1GojLQknG8%3D&amp;reserved=0
>> > Issue:
>> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
>> > ub.com%2Fapache%2Fairflow%2Fissues%2F13364&amp;data=04%7C01%7Ccomartin
>> > eau%40cdpq.com%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec384
>> > 3fe27c41a63957%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8ey
>> > JWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100
>> > 0&amp;sdata=Pukf%2FxlUdF%2BCk3Q9NEbNaSEkPBUzh4e3DkntvGY0P0c%3D&amp;res
>> > erved=0
>> >
>> > Please let me know if you have any questions or suggestions on how to
>> improve this solution, or if you disagree with my approach.
>> >
>> > Thanks,
>> > Zacharya.
>>
>> ________________________________
>>
>> Avis de confidentialité : Ce courriel et les pièces qui y sont jointes
>> contiennent de l'information confidentielle et peuvent être protégés par le
>> secret professionnel ou constituer de l’information privilégiée. Ils sont
>> destinés à l'usage exclusif de la (des) personne(s) à qui ils sont
>> adressés. Si vous n'êtes pas le destinataire visé ou la personne chargée de
>> transmettre ce document à son destinataire, vous êtes avisé par la présente
>> que toute divulgation, reproduction, copie, distribution ou autre
>> utilisation de cette information est strictement interdite. Si vous avez
>> reçu ce courriel par erreur, veuillez en aviser immédiatement l’expéditeur
>> par téléphone ainsi que détruire et effacer l'information que vous avez
>> reçue de tout disque dur ou autre média sur lequel elle peut être
>> enregistrée et ne pas en conserver de copie. Merci de votre collaboration.
>>
>> ________________________________
>>
>> Notice of Confidentiality: This electronic mail message, including any
>> attachments, is confidential and may be privileged and protected by
>> professional secrecy. They are intended for the exclusive use of the
>> addressee. If you are not the intended addressee or the person responsible
>> for delivering this document to the intended addressee, you are hereby
>> advised that any disclosure, reproduction, copy, distribution or other use
>> of this information is strictly forbidden. If you have received this
>> document by mistake, please immediately inform the sender by telephone,
>> destroy and delete the information received from any hard disk or any media
>> on which it may have been registered and do not keep any copy. Thank you
>> for your cooperation.
>>
>
>
> --
> +48 660 796 129
>

Reply via email to