Hi All, I have a number of jobs that use scikit-learn for scoring players. Occasionally I need to upgrade scikit-learn to take advantage of some new features. We have a single conda environment that specifies all the dependencies for Airflow as well as for all of our DAGs. So currently upgrading scikit-learn means upgrading it for all DAGs that use it, and retraining all models for that version. It becomes a very involved task and I'm hoping to find a better way.
One option is to use BashOperator (or something that wraps BashOperator) and have bash use a specific conda environment with that version of scikit-learn. While simple, I don't like the idea of limiting task input to the command line. Still, an option. Another option is the DockerOperator. But when I asked around at a previous Airflow Meetup, I couldn't find anyone actually using it. It also adds some complexity to the build and deploy process in that now I have to maintain docker images for all my environments. Still, not ruling it out. And the last option I can think of is just heterogeneous workers. We are migrating our Airflow infrastructure to AWS ECS (from EC2) and plan on having support for separate worker clusters, so this could include workers with different conda environments. I assume as long as a few key packages are identical between scheduler and worker instances (airflow, redis, celery?) the rest can be whatever. Has anyone faced this problem and have some advice? Am I missing any simpler options? Any thoughts much appreciated. thanks, Dennis