Hi All,

I have a number of jobs that use scikit-learn for scoring players.
Occasionally I need to upgrade scikit-learn to take advantage of some new
features.  We have a single conda environment that specifies all the
dependencies for Airflow as well as for all of our DAGs.  So currently
upgrading scikit-learn means upgrading it for all DAGs that use it, and
retraining all models for that version.  It becomes a very involved task
and I'm hoping to find a better way.

One option is to use BashOperator (or something that wraps BashOperator)
and have bash use a specific conda environment with that version of
scikit-learn.  While simple, I don't like the idea of limiting task input
to the command line.  Still, an option.

Another option is the DockerOperator.  But when I asked around at a
previous Airflow Meetup, I couldn't find anyone actually using it.  It also
adds some complexity to the build and deploy process in that now I have to
maintain docker images for all my environments.  Still, not ruling it out.

And the last option I can think of is just heterogeneous workers.  We are
migrating our Airflow infrastructure to AWS ECS (from EC2) and plan on
having support for separate worker clusters, so this could include workers
with different conda environments.  I assume as long as a few key packages
are identical between scheduler and worker instances (airflow, redis,
celery?) the rest can be whatever.

Has anyone faced this problem and have some advice?  Am I missing any
simpler options?  Any thoughts much appreciated.

thanks,
Dennis

Reply via email to