Re: best way to handle version upgrades of libraries used by tasks

Gerard Toonstra Tue, 30 Jan 2018 10:07:03 -0800

As long as the differences are in API methods and not a rearrangement of
the
package structure the latter option would work. This is because the
operators would
be imported by the scheduler, just not executed (and therefore perhaps not
call the
specific operator methods).


If you serialize the parameters into a json structure to a string, you can
simplify how data
is passed to the command line, which reduces the number of parameters you'd
have to pass.

You could look into the 'queue' parameter for a task, which forces the task
instance to be run
on a specific worker instead. Seen that?  Then you don't need to maintain
all different
versions of conda on all workers and can use API's to spin up/down those
specific workers
ahead of time.

Rgds,

G>


On Tue, Jan 30, 2018 at 6:13 PM, Dennis O'Brien <den...@dennisobrien.net>
wrote:

> Hi All,
>
> I have a number of jobs that use scikit-learn for scoring players.
> Occasionally I need to upgrade scikit-learn to take advantage of some new
> features.  We have a single conda environment that specifies all the
> dependencies for Airflow as well as for all of our DAGs.  So currently
> upgrading scikit-learn means upgrading it for all DAGs that use it, and
> retraining all models for that version.  It becomes a very involved task
> and I'm hoping to find a better way.
>
> One option is to use BashOperator (or something that wraps BashOperator)
> and have bash use a specific conda environment with that version of
> scikit-learn.  While simple, I don't like the idea of limiting task input
> to the command line.  Still, an option.
>
> Another option is the DockerOperator.  But when I asked around at a
> previous Airflow Meetup, I couldn't find anyone actually using it.  It also
> adds some complexity to the build and deploy process in that now I have to
> maintain docker images for all my environments.  Still, not ruling it out.
>
> And the last option I can think of is just heterogeneous workers.  We are
> migrating our Airflow infrastructure to AWS ECS (from EC2) and plan on
> having support for separate worker clusters, so this could include workers
> with different conda environments.  I assume as long as a few key packages
> are identical between scheduler and worker instances (airflow, redis,
> celery?) the rest can be whatever.
>
> Has anyone faced this problem and have some advice?  Am I missing any
> simpler options?  Any thoughts much appreciated.
>
> thanks,
> Dennis
>

Re: best way to handle version upgrades of libraries used by tasks

Reply via email to