Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow

Jarek Potiuk Fri, 08 Jan 2021 11:11:47 -0800

Yep. My thought exactly with in-memory cache.

On Fri, Jan 8, 2021 at 8:00 PM Tomasz Urbaszek <[email protected]> wrote:


> I like the idea of immutable cache. I would be even tempted to mix it with
> some form of load balancing in case of celery-like executors to optimise
> the performance. And that may be also extended to support not only venvs
> but also in-memory cache (for example for ML models) as you mentioned.
>
> T.
>
> On Fri, Jan 8, 2021 at 7:19 PM Jarek Potiuk <[email protected]> wrote:
>
>> As discussed before in the discussion. The idea is intriguing and I think
>> it opens up an easier automated management for different teams or even
>> writing isolated tasks. I have few main concerns:
>>
>> 1) the general management (mentioned by Tomek). For Celery and Local
>> Executor several workers can co-exist and even if we "name" virtualenv as
>> proposed in option b) they can easily override each other in case several
>> tasks run on the same worker machine.
>>
>> There are many questions that need to be answered for management of the
>> venv and IMHO if we agree to some solution the design will have to
>> document behaviour of all those. I am not saying that we need to have
>> answers now to all of those but eventually the design should cover all that:
>>      - how do we update the envs when they change?
>>      - how do we know it has changed and needs an update?
>>      - do we always try to update to the latest version? Is it
>> 'eager-upgrade' or "upgrade-if-needed"?
>>     - what do we do when other tasks use the same venv while we are
>> upgrading it? should we wait until they finish? Or should we risk
>> inconsistencies/failures?
>>     - do we lock the requirements to specific versions always or allow
>> for ranges of versions ?
>>     - Which versions do we lock - only the dependencies in definition of
>> the venv or also transitive (pip freeze)?
>>     - How do we cope when the version of airflow is upgraded with newer
>> dependencies which conflict with existing venvs? Should we delete them?
>>     - How do we cope if we only add one provider in airflow which has
>> conflicting dependencies? This will be a far more frequent event and for
>> sure one that will be done even without restarting airflow I believe.
>>     - How do we deal with a situation when there is a conflict or when
>> transitional dependency introduces new conflict and we cannot install venv
>> cleanly (do we revert ?)
>>     - How do we deal with the situation when transitive dependency causes
>> failures after installation?
>>     - how do we keep track which worker already has which version of the
>> venv ?
>>     - if/when  we delete those venvs ?
>>
>> There are more questions as well on when we add multiple workers - should
>> the venv be identical on different workers, or should we allow for
>> differences? what happens when one task starts on one worker, and then is
>> retried on a different one yielding different results
>> Finally - what happens is tasks from the past are re-executed (using
>> backfill). Should they use the new version of the venv or the one they were
>> executed with previously? How to keep track of it?
>>
>> 2) KubernetesExecutor:
>> - here the usefulness of optimisations is very limited IMHO if we just
>> 'install" the .venv. It will work and there is an obvious isolation (but it
>> is obviously slower).
>>
>> I think the "naive" install-what-you-specify approach suffers from all
>> that. However I think we might approach it differently.
>>
>> My proposal to consider:
>>
>>
>>
>> Possibly a lot (if not all) of those problems can be solved if we
>> introduce a concept known for example from GithubActions (or other CI
>> systems). - immutable cache. We are really talking about immutable cache of
>> 'venvs" here - no more, no less and way to manage them.
>> In the CI system such immutable caches are used for the very purpose -
>> storing venvs is a basic utility of those. And since those systems are
>> already used in similar environments (muti-workers setup, rerunning stuff,
>> etc.) seems that this problem is already solved.
>>
>>    - CircleCI caching: https://circleci.com/docs/2.0/caching/
>>    - GA caching:
>>    
>> https://docs.github.com/en/free-pro-team@latest/actions/guides/caching-dependencies-to-speed-up-workflows
>>    - TraviCI caching: https://docs.travis-ci.com/user/caching/
>>
>> For this approach, managing venvs and other envs is largely a solved
>> problem:
>>
>> 1) you define a way to determine the unique id (usually md5 hash of files
>> contributing). In our case the hash might be based on "requirements.txt +
>> airflow version + all provider versions" to account for seamless upgrades
>> of airflow and providers.
>> 2) you have a central cache storage where you keep the cache
>> 3) cache is immutable  - once you create it it stays untouched (though it
>> allows some initial short-term duplication when several tasks create same
>> cache  - but only one succeeds)
>> 4) the only way to change the cache is to modify the sources (definition,
>> requests etc)
>> 5) cache is uploaded to a central location - this has the great benefit
>> to optimise the speed of venv creations across multiple workers AND include
>> KubernetesExecutor as well!
>> 6) this solution can then be applied to anything:
>>      - caching .npm if someone wants to run their javascript
>>      - preparing and caching machine learning models
>>      - caching .maven deps for java
>>      .....
>>
>> I believe while answering the questions above, we will have to figure out
>> something very similar to this pattern, so IMHO, this is a great
>> opportunity to 'stand on the shoulders of giants' and rather than
>> reinventing the wheel, we should simply implement such immutable cache and
>> use it for .venv.
>>
>>
>> J.
>>
>>
>>
>> On Fri, Jan 8, 2021 at 6:20 PM Martineau, Constance <[email protected]>
>> wrote:
>>
>>> This is a problem I have been grappling with for some time, and am
>>> intrigued by your proposal! For departments that are under the same
>>> "umbrella" department but are comprised of many different decentralized
>>> smaller groups, the ability to see and interact with other people's dags
>>> via the UI - while also managing their own dependencies - would be a
>>> feature, not a bug. The Airflow UI, with descriptive dag and task ids, are
>>> a visual cue for the subgroups to speak with one another to leverage each
>>> other's work as a base instead of reinventing the wheel.
>>>
>>> How would this work if one were using the KubernetesExecutor?
>>>
>>> Cheers,
>>> Constance
>>>
>>> / CONSTANCE MARTINEAU  | Développeuse Principale, Platformes et
>>> Exploitation | Tél: 514-847-7992 | [email protected]
>>>
>>> -----Original Message-----
>>> From: Tomasz Urbaszek <[email protected]>
>>> Sent: Friday, January 8, 2021 11:48 AM
>>> To: [email protected]
>>> Subject: Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow
>>>
>>> Thanks Zacharya! I think this is an interesting idea. It seems to be a
>>> simple way to add "multi tenancy" to Airflow. However, I'm afraid that
>>> having separate venvs for separate tasks (=teams) solves only dependency
>>> management problem not managing multiple teams using the same cluster
>>> (different teams still can view, delete, update DAGs of other teams and
>>> they share the same db). As far as I know there's no "production grade" way
>>> to deploy a single Airflow instance that can be shared by teams. Although I
>>> know there are ways to achieve that using GKE cluster with multiple
>>> instances of Airflow.
>>>
>>> Now, focusing on your proposition. If we will decide that we want to
>>> support such a feature we will need to think about retention policy for the
>>> venvs. Having multiple venvs will increase disk/memory allocation and we
>>> should try to remove venvs that are no longer used.
>>> In general we will need a mechanism to manage those venvs (define,
>>> create, update, delete).
>>>
>>> That said I'm not convinced if Airflow should take care of its
>>> environment/deployment. In my opinion this is a users' task to make sure
>>> that their tasks are executed in the right environment. And I think this is
>>> easily achievable with multiple workers/queues and docker images that can
>>> be build on CI/CD systems.
>>>
>>> Cheers,
>>> Tomek
>>>
>>> On Fri, Jan 8, 2021 at 5:26 PM Zacharya Haitin <[email protected]>
>>> wrote:
>>> >
>>> > Hi everyone,
>>> >
>>> > Currently, there is no easy way to manage an Airflow cluster that
>>> contains multiple teams and python based DAGS.
>>> > I submitted the following AIP with a suggestion how to make the venv
>>> management part of the Airflow executors lifecycle and would love to get
>>> your feedback.
>>> >
>>> > My suggestion should be pretty easy to implement and will help the
>>> users with python packages deployments.
>>> >
>>> > AIP:
>>> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik
>>> > i.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-37%2BVirtualenv%2B
>>> > management%2Binside%2BAirflow&amp;data=04%7C01%7Ccomartineau%40cdpq.co
>>> > m%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec3843fe27c41a6395
>>> > 7%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA
>>> > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O
>>> > uFDY%2F9PQpYZ7xuW%2BX65GwlQ2OA3HyY8G1GojLQknG8%3D&amp;reserved=0
>>> > Issue:
>>> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
>>> > ub.com%2Fapache%2Fairflow%2Fissues%2F13364&amp;data=04%7C01%7Ccomartin
>>> > eau%40cdpq.com%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec384
>>> > 3fe27c41a63957%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8ey
>>> > JWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100
>>> > 0&amp;sdata=Pukf%2FxlUdF%2BCk3Q9NEbNaSEkPBUzh4e3DkntvGY0P0c%3D&amp;res
>>> > erved=0
>>> >
>>> > Please let me know if you have any questions or suggestions on how to
>>> improve this solution, or if you disagree with my approach.
>>> >
>>> > Thanks,
>>> > Zacharya.
>>>
>>> ________________________________
>>>
>>> Avis de confidentialité : Ce courriel et les pièces qui y sont jointes
>>> contiennent de l'information confidentielle et peuvent être protégés par le
>>> secret professionnel ou constituer de l’information privilégiée. Ils sont
>>> destinés à l'usage exclusif de la (des) personne(s) à qui ils sont
>>> adressés. Si vous n'êtes pas le destinataire visé ou la personne chargée de
>>> transmettre ce document à son destinataire, vous êtes avisé par la présente
>>> que toute divulgation, reproduction, copie, distribution ou autre
>>> utilisation de cette information est strictement interdite. Si vous avez
>>> reçu ce courriel par erreur, veuillez en aviser immédiatement l’expéditeur
>>> par téléphone ainsi que détruire et effacer l'information que vous avez
>>> reçue de tout disque dur ou autre média sur lequel elle peut être
>>> enregistrée et ne pas en conserver de copie. Merci de votre collaboration.
>>>
>>> ________________________________
>>>
>>> Notice of Confidentiality: This electronic mail message, including any
>>> attachments, is confidential and may be privileged and protected by
>>> professional secrecy. They are intended for the exclusive use of the
>>> addressee. If you are not the intended addressee or the person responsible
>>> for delivering this document to the intended addressee, you are hereby
>>> advised that any disclosure, reproduction, copy, distribution or other use
>>> of this information is strictly forbidden. If you have received this
>>> document by mistake, please immediately inform the sender by telephone,
>>> destroy and delete the information received from any hard disk or any media
>>> on which it may have been registered and do not keep any copy. Thank you
>>> for your cooperation.
>>>
>>
>>
>> --
>> +48 660 796 129
>>
>

-- 
+48 660 796 129

Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow

Reply via email to