Re: Airflow and Machine Learning

Soma S Dhavala Wed, 19 Feb 2020 20:23:36 -0800

daggit design doc
<https://docs.google.com/document/d/153n7aj9P7bn1-EqqJP3ChqZ8xVJ4JKUCHZMnXTe-qZY/edit?usp=sharing>
outlines the vision of what we were looking for in terms of an
ML-As-A-Serive platform.
some ideas on making the apps composable is here
<https://docs.google.com/document/d/1-l0EZveZJAxcRNdTOvNzVArq6V37Aj24H0CGd76Nh80/edit?usp=sharing>
a deck
<https://docs.google.com/presentation/d/1M6rD8AYWauC6MvHZqTyjmK24NX-9cVevlIMeDgRASmw/edit?usp=sharing>
on use cases.


On Thu, Feb 20, 2020 at 9:43 AM Soma S Dhavala <soma.dhav...@gmail.com>
wrote:

> At project sunbird, we built daggit
> <https://github.com/project-sunbird/sunbird-ml-workbench>,  an open
> source ML-As-A-Service platform on the top of airflow. While airflow and
> other ML platforms have taken *code-as- *
> *configuration* approach, we like to have users declaratively specify
> their ML Apps via yaml/jsons. We have to parse those ML App specs, and
> programmatically write the DAGs that airflow can understand.
>
> pain points: programmatically creating dags seems like a drag. some
> specific keywords have to be placed in the auto generated DAG file,
> otherwise, DAG bag wont be filled. Not sure something has changed with
> airflow > 1.9.
>
>
>
>
>
>
>
>
> On Thu, Feb 20, 2020 at 8:42 AM Daniel Imberman <daniel.imber...@gmail.com>
> wrote:
>
>> Thank you everyone for this feedback! I will organize these (and other)
>> ideas and look forward to the conversation it starts!
>>
>> On Wed, Feb 19, 2020 at 9:54 AM, Ben Tallman <btall...@gmail.com> wrote:
>> I don’t really have time to unpack a lot here, but we use airflow to
>> extensively orchestrate Databricks Notebook based jobs. To date, we haven’t
>> really exposed the notebook visualizations in the Airflow UI, but instead
>> provide deep links to the job output.
>>
>> We spent a not insignificant amount of time building handlers into our
>> operators that take convention based XCom data and pass it from job to job
>> through the pipeline. In many cases, these aren’t ML jobs though, but they
>> are Notebook style pipelines and we use XCom in this way to break the jobs
>> up between notebooks.
>>
>> Thanks,
>> Ben
>>
>> --
>> Ben Tallman
>> Chief Technology Officer
>>
>> M Science LLC
>> 101 SW Main Street, Suite 350
>> Portland, OR 97204
>> 503-433-1552 (o/m)
>> btall...@mscience.com<“mailto:btall...@mscience.com”>
>> mscience.com<“https://mscience.com”>
>> ________________________________
>> From: Maxime Beauchemin <maximebeauche...@gmail.com>
>> Sent: Wednesday, February 19, 2020 9:30:30 AM
>> To: dev@airflow.apache.org <dev@airflow.apache.org>
>> Subject: Re: Airflow and Machine Learning
>>
>> I'd have a lot of thoughts to unpack here, but top of mind is a deeper
>> integration with [jupyter] notebooks and/or hosted notebooks-type systems.
>> Notebooks [with papermill <https://github.com/nteract/papermill>] can be
>> parameterized predictably, and notebook files provide rich log outputs
>> (organized by cells, can show data samples, charts, ...). For many ML
>> practitioners, it seems like a system that can execute and orchestrate
>> notebooks is a large chunk of what they need.
>>
>> Maybe a special [deeply integrated] notebook operator that can 1)
>> bootstrap
>> a specified docker image, 2) visualize ipynb in place of logs in the
>> Airflow UI. On top of that maybe an Airflow plugin that enables people to
>> execute or schedule notebooks without crafting a DAG, though there's
>> probably a need for control mechanisms to be in place in that case.
>>
>> Max
>>
>> On Wed, Feb 19, 2020 at 8:47 AM Dan Davydov <ddavy...@twitter.com.invalid
>> >
>> wrote:
>>
>> > Twitter uses Airflow primarily for ML, to create automated pipelines for
>> > retraining data, but also for more ad-hoc training jobs.
>> >
>> > The biggest gaps are on the experimentation side. It takes too long for
>> a
>> > new user to set up and run a pipeline and then iterate on it. This
>> problem
>> > is a bit more unique to ML than other domains because 1) training jobs
>> can
>> > take a very long time to run, and 2) users have the need to launch
>> multiple
>> > experiments in parallel for the same model pipeline.
>> >
>> > Biggest Gaps:
>> > - Too much boilerplate to write DAGs compared to Dagster/etc, and
>> > difficulty in message passing (XCom). There was a proposal recently to
>> > improve this in Airflow which should be entering AIP soon.
>> > - Lack of pipeline isolation which hurts model experimentation (being
>> able
>> > to run a DAG, modify it, and run it again without affecting the previous
>> > run), lack of isolation of DAGs from Airflow infrastructure (inability
>> to
>> > redeploy Airflow infra without also redeploying DAGs) also hurts.
>> > - Lack of multi-tenancy; it's hard for customers to quickly launch an
>> > ad-hoc pipeline, the overhead of setting up a cluster and all of its
>> > dependencies is quite high
>> > - Lack of integration with data visualization plugins (e.g. plugins for
>> > rendering data related to a task when you click a task instance in the
>> UI).
>> > - Lack of simpler abstractions for users with limited knowledge of
>> Airflow
>> > or even python to build simple pipelines (not really an Airflow problem,
>> > but rather the need for a good abstraction that sits on top of Airflow
>> like
>> > a drag-and-drop pipeline builder)
>> >
>> > FWIW my personal feeling is that a fair number companies in the ML space
>> > are moving to alternate solutions like TFX Pipelines due to the focus
>> these
>> > platforms these have on ML (ML pipelines are first-class citizens), and
>> > support from Google. Would be great if we could change that. The ML
>> > orchestration/tooling space is definitely evolving very rapidly and
>> there
>> > are also new promising entrants as well.
>> >
>> > On Wed, Feb 19, 2020 at 10:56 AM Germain Tanguy
>> > <germain.tan...@dailymotion.com.invalid> wrote:
>> >
>> > > Hello Daniel,
>> > >
>> > > In my company we use airflow to update our ML models and to predict.
>> > >
>> > > As we use kubernetesOperator to trigger jobs, each ML DAG are similar
>> and
>> > > ML/Data science engineer can reuse a template and choose which type of
>> > > machine they needs (highcpu, highmem, GPU or not..etc)
>> > >
>> > > We have a process in place describe in the second part of this article
>> > > (Industrializing machine learning pipeline) :
>> > >
>> >
>> https://medium.com/dailymotion/collaboration-between-data-engineers-data-analysts-and-data-scientists-97c00ab1211f
>> > >
>> > > Hope this help.
>> > >
>> > > Germain.
>> > >
>> > > On 19/02/2020 16:42, "Daniel Imberman" <daniel.imber...@gmail.com>
>> > wrote:
>> > >
>> > > Hello everyone!
>> > >
>> > > I’m working on a few proposals to make Apache Airflow more friendly
>> > > for ML/Data science use-cases, and I wanted to reach out in hopes of
>> > > hearing from people that are using/wish to use Airflow for ML. If you
>> > have
>> > > any opinions on the subject, I’d love to hear what you’re all working
>> on!
>> > >
>> > > Current questions I’m looking into:
>> > >
>> > > 1. How do you use Airflow for your ML? Has it worked out well for
>> > you?
>> > > 2. Are there any features that would improve your experience of
>> > > building models on Airflow?
>> > > 3. Have you built anything on top of airflow/around Airflow to aide
>> > > you in this process?
>> > >
>> > > Thank you so much for your time!
>> > >
>> > > via Newton Mail [
>> > >
>> >
>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloudmagic.com%2Fk%2Fd%2Fmailapp%3Fct%3Ddx%26cv%3D10.0.32%26pv%3D10.14.6%26source%3Demail_footer_2&amp;data=02%7C01%7Cgermain.tanguy%40dailymotion.com%7C2f6dfaee7bdf467a651108d7b552411d%7C37530da3f7a748f4ba462dc336d55387%7C0%7C0%7C637177237197962425&amp;sdata=s4YovJSTKgLqi%2BAjRXfQFVntaPUyTO%2BTAlJnCIVygYE%3D&amp;reserved=0
>> > > ]
>> > >
>> > >
>> >
>
>

Re: Airflow and Machine Learning

Reply via email to