Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow

Jarek Potiuk Fri, 08 Jan 2021 10:19:47 -0800

As discussed before in the discussion. The idea is intriguing and I think
it opens up an easier automated management for different teams or even
writing isolated tasks. I have few main concerns:


1) the general management (mentioned by Tomek). For Celery and Local
Executor several workers can co-exist and even if we "name" virtualenv as
proposed in option b) they can easily override each other in case several
tasks run on the same worker machine.

There are many questions that need to be answered for management of the
venv and IMHO if we agree to some solution the design will have to
document behaviour of all those. I am not saying that we need to have
answers now to all of those but eventually the design should cover all that:
     - how do we update the envs when they change?
     - how do we know it has changed and needs an update?
     - do we always try to update to the latest version? Is it
'eager-upgrade' or "upgrade-if-needed"?
    - what do we do when other tasks use the same venv while we are
upgrading it? should we wait until they finish? Or should we risk
inconsistencies/failures?
    - do we lock the requirements to specific versions always or allow for
ranges of versions ?
    - Which versions do we lock - only the dependencies in definition of
the venv or also transitive (pip freeze)?
    - How do we cope when the version of airflow is upgraded with newer
dependencies which conflict with existing venvs? Should we delete them?
    - How do we cope if we only add one provider in airflow which has
conflicting dependencies? This will be a far more frequent event and for
sure one that will be done even without restarting airflow I believe.
    - How do we deal with a situation when there is a conflict or when
transitional dependency introduces new conflict and we cannot install venv
cleanly (do we revert ?)
    - How do we deal with the situation when transitive dependency causes
failures after installation?
    - how do we keep track which worker already has which version of the
venv ?
    - if/when  we delete those venvs ?

There are more questions as well on when we add multiple workers - should
the venv be identical on different workers, or should we allow for
differences? what happens when one task starts on one worker, and then is
retried on a different one yielding different results
Finally - what happens is tasks from the past are re-executed (using
backfill). Should they use the new version of the venv or the one they were
executed with previously? How to keep track of it?

2) KubernetesExecutor:
- here the usefulness of optimisations is very limited IMHO if we just
'install" the .venv. It will work and there is an obvious isolation (but it
is obviously slower).

I think the "naive" install-what-you-specify approach suffers from all
that. However I think we might approach it differently.

My proposal to consider:



Possibly a lot (if not all) of those problems can be solved if we introduce
a concept known for example from GithubActions (or other CI systems). -
immutable cache. We are really talking about immutable cache of 'venvs"
here - no more, no less and way to manage them.
In the CI system such immutable caches are used for the very purpose -
storing venvs is a basic utility of those. And since those systems are
already used in similar environments (muti-workers setup, rerunning stuff,
etc.) seems that this problem is already solved.

   - CircleCI caching: https://circleci.com/docs/2.0/caching/
   - GA caching:
   
https://docs.github.com/en/free-pro-team@latest/actions/guides/caching-dependencies-to-speed-up-workflows
   - TraviCI caching: https://docs.travis-ci.com/user/caching/

For this approach, managing venvs and other envs is largely a solved
problem:

1) you define a way to determine the unique id (usually md5 hash of files
contributing). In our case the hash might be based on "requirements.txt +
airflow version + all provider versions" to account for seamless upgrades
of airflow and providers.
2) you have a central cache storage where you keep the cache
3) cache is immutable  - once you create it it stays untouched (though it
allows some initial short-term duplication when several tasks create same
cache  - but only one succeeds)
4) the only way to change the cache is to modify the sources (definition,
requests etc)
5) cache is uploaded to a central location - this has the great benefit to
optimise the speed of venv creations across multiple workers AND include
KubernetesExecutor as well!
6) this solution can then be applied to anything:
     - caching .npm if someone wants to run their javascript
     - preparing and caching machine learning models
     - caching .maven deps for java
     .....

I believe while answering the questions above, we will have to figure out
something very similar to this pattern, so IMHO, this is a great
opportunity to 'stand on the shoulders of giants' and rather than
reinventing the wheel, we should simply implement such immutable cache and
use it for .venv.


J.



On Fri, Jan 8, 2021 at 6:20 PM Martineau, Constance <[email protected]>
wrote:

> This is a problem I have been grappling with for some time, and am
> intrigued by your proposal! For departments that are under the same
> "umbrella" department but are comprised of many different decentralized
> smaller groups, the ability to see and interact with other people's dags
> via the UI - while also managing their own dependencies - would be a
> feature, not a bug. The Airflow UI, with descriptive dag and task ids, are
> a visual cue for the subgroups to speak with one another to leverage each
> other's work as a base instead of reinventing the wheel.
>
> How would this work if one were using the KubernetesExecutor?
>
> Cheers,
> Constance
>
> / CONSTANCE MARTINEAU  | Développeuse Principale, Platformes et
> Exploitation | Tél: 514-847-7992 | [email protected]
>
> -----Original Message-----
> From: Tomasz Urbaszek <[email protected]>
> Sent: Friday, January 8, 2021 11:48 AM
> To: [email protected]
> Subject: Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow
>
> Thanks Zacharya! I think this is an interesting idea. It seems to be a
> simple way to add "multi tenancy" to Airflow. However, I'm afraid that
> having separate venvs for separate tasks (=teams) solves only dependency
> management problem not managing multiple teams using the same cluster
> (different teams still can view, delete, update DAGs of other teams and
> they share the same db). As far as I know there's no "production grade" way
> to deploy a single Airflow instance that can be shared by teams. Although I
> know there are ways to achieve that using GKE cluster with multiple
> instances of Airflow.
>
> Now, focusing on your proposition. If we will decide that we want to
> support such a feature we will need to think about retention policy for the
> venvs. Having multiple venvs will increase disk/memory allocation and we
> should try to remove venvs that are no longer used.
> In general we will need a mechanism to manage those venvs (define, create,
> update, delete).
>
> That said I'm not convinced if Airflow should take care of its
> environment/deployment. In my opinion this is a users' task to make sure
> that their tasks are executed in the right environment. And I think this is
> easily achievable with multiple workers/queues and docker images that can
> be build on CI/CD systems.
>
> Cheers,
> Tomek
>
> On Fri, Jan 8, 2021 at 5:26 PM Zacharya Haitin <[email protected]>
> wrote:
> >
> > Hi everyone,
> >
> > Currently, there is no easy way to manage an Airflow cluster that
> contains multiple teams and python based DAGS.
> > I submitted the following AIP with a suggestion how to make the venv
> management part of the Airflow executors lifecycle and would love to get
> your feedback.
> >
> > My suggestion should be pretty easy to implement and will help the users
> with python packages deployments.
> >
> > AIP:
> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwik
> > i.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-37%2BVirtualenv%2B
> > management%2Binside%2BAirflow&amp;data=04%7C01%7Ccomartineau%40cdpq.co
> > m%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec3843fe27c41a6395
> > 7%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA
> > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O
> > uFDY%2F9PQpYZ7xuW%2BX65GwlQ2OA3HyY8G1GojLQknG8%3D&amp;reserved=0
> > Issue:
> > https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgith
> > ub.com%2Fapache%2Fairflow%2Fissues%2F13364&amp;data=04%7C01%7Ccomartin
> > eau%40cdpq.com%7C23c6c1ef2783438ee9a708d8b3f5326d%7C0bdbe0278f504ec384
> > 3fe27c41a63957%7C1%7C0%7C637457212977264140%7CUnknown%7CTWFpbGZsb3d8ey
> > JWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C100
> > 0&amp;sdata=Pukf%2FxlUdF%2BCk3Q9NEbNaSEkPBUzh4e3DkntvGY0P0c%3D&amp;res
> > erved=0
> >
> > Please let me know if you have any questions or suggestions on how to
> improve this solution, or if you disagree with my approach.
> >
> > Thanks,
> > Zacharya.
>
> ________________________________
>
> Avis de confidentialité : Ce courriel et les pièces qui y sont jointes
> contiennent de l'information confidentielle et peuvent être protégés par le
> secret professionnel ou constituer de l’information privilégiée. Ils sont
> destinés à l'usage exclusif de la (des) personne(s) à qui ils sont
> adressés. Si vous n'êtes pas le destinataire visé ou la personne chargée de
> transmettre ce document à son destinataire, vous êtes avisé par la présente
> que toute divulgation, reproduction, copie, distribution ou autre
> utilisation de cette information est strictement interdite. Si vous avez
> reçu ce courriel par erreur, veuillez en aviser immédiatement l’expéditeur
> par téléphone ainsi que détruire et effacer l'information que vous avez
> reçue de tout disque dur ou autre média sur lequel elle peut être
> enregistrée et ne pas en conserver de copie. Merci de votre collaboration.
>
> ________________________________
>
> Notice of Confidentiality: This electronic mail message, including any
> attachments, is confidential and may be privileged and protected by
> professional secrecy. They are intended for the exclusive use of the
> addressee. If you are not the intended addressee or the person responsible
> for delivering this document to the intended addressee, you are hereby
> advised that any disclosure, reproduction, copy, distribution or other use
> of this information is strictly forbidden. If you have received this
> document by mistake, please immediately inform the sender by telephone,
> destroy and delete the information received from any hard disk or any media
> on which it may have been registered and do not keep any copy. Thank you
> for your cooperation.
>


-- 
+48 660 796 129

Re: [DISCUSS] [AIP-37] Virtualenv management inside Airflow

Reply via email to