Re: Airflow DAG Serialisation

Kaxil Naik Mon, 29 Jul 2019 23:28:15 -0700

Thanks Kevin for the input.

I am working on this full-time as well with help from Ash.


Persistence to DB is what we want to achieve while making Webserver
stateless and hence mitigating the issues you described.

My opinion is that we should work towards that goal and aim to include it
in Airflow 2.0 . However I am not completely against a short-term solution
that will be completely optional, that might increase the work though.

The diff between master and the current release branch is huge. So we
should aim to cut 2.0 by the end of this year with Dag Serialisation &
peristence in DB feature.

We can then have minor releases directly from Master and follow SemVer as
well.

Regards,
Kaxil


On Tue, Jul 30, 2019, 09:02 Kevin Yang <yrql...@gmail.com> wrote:

> For sure! I'll try my best to squeeze some time for it during the weekend
> and see how I can help facilitate the effort( don't get the hopes up too
> much tho, got ~2 million things piled in my TODO list :P ). Will bring this
> up to the team and see if my team can help too.
>
> You guys have showed pretty solid understanding and skills so I believe you
> can handle it well w/o me but just in case you need me, don't hesitate to
> shoot me a direct mail.
>
> Cheers,
> Kevin Y
>
> On Mon, Jul 29, 2019 at 8:19 PM Zhou Fang <zhouf...@google.com> wrote:
>
> > Hi Kevin, it makes sense. Thanks for the explanation! Hope we can get DB
> > persistence move faster.
> >
> > Zhou
> >
> > On Mon, Jul 29, 2019, 5:46 PM Kevin Yang <yrql...@gmail.com> wrote:
> >
> >> oops, s/consistent file/consistent file order/
> >>
> >> On Mon, Jul 29, 2019 at 5:42 PM Kevin Yang <yrql...@gmail.com> wrote:
> >>
> >>> Hi Zhou,
> >>>
> >>> Totally understood, thank you for that. Streaming logic does cover most
> >>> cases, tho we still have the worst cases where os.walk doesn't give us
> >>> consistent file and file/dir additions/renames causing a different
> result
> >>> order of list_py_file_paths( e.g. right after we parsed the first dir,
> it
> >>> was renamed and will be parsed last in the 2nd DAG loading, or we
> merged a
> >>> new file right after the file paths are collected). Maybe there's a
> way to
> >>> guarantee the order of parsing but not sure if it worth the effort
> given
> >>> that it is less of a problem if the end to end parsing time is small
> >>> enough. I understand it may be started as a short term improvement but
> >>> since it should not be too much more complicated, we'd rather start
> with
> >>> unified long term pattern.
> >>>
> >>> Cheers,
> >>> Kevin Y
> >>>
> >>>
> >>> On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang <zhouf...@google.com> wrote:
> >>>
> >>>> Hi Kevin,
> >>>>
> >>>> Yes. DAG persistence in DB is definitely the way to go. I referred to
> >>>> the aysnc dag loader because it may alleviate your current problem
> (since
> >>>> it is code ready).
> >>>>
> >>>> It actually reduces the time to 15min, because DAGs are refreshed by
> >>>> the background process in a streaming way and you don't need to
> restart
> >>>> webserver per 20min.
> >>>>
> >>>>
> >>>>
> >>>> Thanks,
> >>>> Zhou
> >>>>
> >>>>
> >>>> On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang <yrql...@gmail.com> wrote:
> >>>>
> >>>>> Hi Zhou,
> >>>>>
> >>>>> Thank you for the pointer. This solves the issue gunicorn restart
> rate
> >>>>> throttles webserver refresh rate but not the long DAG parsing time
> issue,
> >>>>> right? Worst case scenario we still wait 30 mins for the change to
> show up,
> >>>>> comparing to the previous 35 mins( I was wrong on the number, it
> should be
> >>>>> 35 mins instead of 55 mins as the clock starts whenever the webserver
> >>>>> restarts). I believe in the previous discussion, we firstly proposed
> this
> >>>>> local webserver DAG parsing optimization to use the same DAG parsing
> logic
> >>>>> in scheduler to speed up the parsing. Then the stateless webserver
> proposal
> >>>>> came up and we were brought in that it is a better idea to persist
> DAGs
> >>>>> into the DB and read directly from the DB, better DAG def
> consistency and
> >>>>> webserver cluster consistency. I'm all supportive on the proposed
> structure
> >>>>> in AIP-24 but -1 on just feed webserver with a single subprocess
> parsing
> >>>>> the DAGs. I would image there won't be too many additional work to
> fetch
> >>>>> from DB instead of a subprocess, would there?( haven't look into the
> >>>>> serialization format part but assuming they are the same/similar)
> >>>>>
> >>>>> Cheers,
> >>>>> Kevin Y
> >>>>>
> >>>>> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang <zhouf...@google.com>
> wrote:
> >>>>>
> >>>>>> Hi Kevin,
> >>>>>>
> >>>>>> The problem that DAG parsing takes a long time can be solved by
> >>>>>> Asynchronous DAG loading:
> https://github.com/apache/airflow/pull/5594
> >>>>>>
> >>>>>> The idea is the a background process parses DAG files, and sends
> DAGs
> >>>>>> to webserver process every [webserver] dagbag_sync_interval = 10s.
> >>>>>>
> >>>>>> We have launched it in Composer, so our users can set
> >>>>>> webserver worker restart interval to 1 hour (or longer). The
> background DAG
> >>>>>> parsing processing refresh all DAGs per [webserver] =
> collect_dags_interval
> >>>>>> = 30s.
> >>>>>>
> >>>>>> If parsing all DAGs take 15min, you can see DAGs being gradually
> >>>>>> freshed with this feature.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Zhou
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <yrql...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> Nice job Zhou!
> >>>>>>>
> >>>>>>> Really excited, exactly what we wanted for the webserver scaling
> >>>>>>> issue.
> >>>>>>> Want to add another big drive for Airbnb to start think about this
> >>>>>>> previously to support the effort: it can not only bring consistency
> >>>>>>> between
> >>>>>>> webservers but also bring consistency between webserver and
> >>>>>>> scheduler/workers. It may be less of a problem if total DAG parsing
> >>>>>>> time is
> >>>>>>> small, but for us the total DAG parsing time is 15+ mins and we had
> >>>>>>> to set
> >>>>>>> the webserver( gunicorn subprocesses) restart interval to 20 mins,
> >>>>>>> which
> >>>>>>> leads to a worst case 15+20+15=50 mins delay between scheduler
> start
> >>>>>>> to
> >>>>>>> schedule things and users can see their deployed DAGs/changes...
> >>>>>>>
> >>>>>>> I'm not so sure about the scheduler performance improvement:
> >>>>>>> currently we
> >>>>>>> already feed the main scheduler process with SimpleDag through
> >>>>>>> DagFileProcessorManager running in a subprocess--in the future we
> >>>>>>> feed it
> >>>>>>> with data from DB, which is likely slower( tho the diff should have
> >>>>>>> negligible impact to the scheduler performance). In fact if we'd
> >>>>>>> keep the
> >>>>>>> existing behavior, try schedule only fresh parsed DAGs, then we may
> >>>>>>> need to
> >>>>>>> deal with some consistency issue--dag processor and the scheduler
> >>>>>>> race for
> >>>>>>> updating the flag indicating if the DAG is newly parsed. No big
> deal
> >>>>>>> there
> >>>>>>> but just some thoughts on the top of my head and hopefully can be
> >>>>>>> helpful.
> >>>>>>>
> >>>>>>> And good idea on pre-rendering the template, believe template
> >>>>>>> rendering was
> >>>>>>> the biggest concern in the previous discussion. We've also chose
> the
> >>>>>>> pre-rendering+JSON approach in our smart sensor API
> >>>>>>> <
> >>>>>>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
> >>>>>>> >
> >>>>>>> and
> >>>>>>> seems to be working fine--a supporting case for ur proposal ;)
> >>>>>>> There's a WIP
> >>>>>>> PR <https://github.com/apache/airflow/pull/5499> for it just in
> >>>>>>> case you
> >>>>>>> are interested--maybe we can even share some logics.
> >>>>>>>
> >>>>>>> Thumbs-up again for this and please don't heisitate to reach out if
> >>>>>>> you
> >>>>>>> want to discuss further with us or need any help from us.
> >>>>>>>
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Kevin Y
> >>>>>>>
> >>>>>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko
> >>>>>>> <fo...@driesprong.frl>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> > Looks great Zhou,
> >>>>>>> >
> >>>>>>> > I have one thing that pops in my mind while reading the AIP;
> >>>>>>> should keep
> >>>>>>> > the caching on the webserver level. As the famous quote goes:
> >>>>>>> *"There are
> >>>>>>> > only two hard things in Computer Science: cache invalidation and
> >>>>>>> naming
> >>>>>>> > things." -- Phil Karlton*
> >>>>>>> >
> >>>>>>> > Right now, the fundamental change that is being proposed in the
> >>>>>>> AIP is
> >>>>>>> > fetching the DAGs from the database in a serialized format, and
> >>>>>>> not parsing
> >>>>>>> > the Python files all the time. This will give already a great
> >>>>>>> performance
> >>>>>>> > improvement on the webserver side because it removes a lot of the
> >>>>>>> > processing. However, since we're still fetching the DAGs from the
> >>>>>>> database
> >>>>>>> > in a regular interval, cache it in the local process, so we still
> >>>>>>> have the
> >>>>>>> > two issues that Airflow is suffering from right now:
> >>>>>>> >
> >>>>>>> >    1. No snappy UI because it is still polling the database in a
> >>>>>>> regular
> >>>>>>> >    interval.
> >>>>>>> >    2. Inconsistency between webservers because they might poll
> in a
> >>>>>>> >    different interval, I think we've all seen this:
> >>>>>>> >    https://www.youtube.com/watch?v=sNrBruPS3r4
> >>>>>>> >
> >>>>>>> > As I also mentioned in the Slack channel, I strongly feel that we
> >>>>>>> should be
> >>>>>>> > able to render most views from the tables in the database, so
> >>>>>>> without
> >>>>>>> > touching the blob. For specific views, we could just pull the
> blob
> >>>>>>> from the
> >>>>>>> > database. In this case we always have the latest version, and we
> >>>>>>> tackle the
> >>>>>>> > second point above.
> >>>>>>> >
> >>>>>>> > To tackle the first one, I also have an idea. We should change
> the
> >>>>>>> DAG
> >>>>>>> > parser from a loop to something that uses inotify
> >>>>>>> > https://pypi.org/project/inotify_simple/. This will change it
> >>>>>>> from polling
> >>>>>>> > to an event-driven design, which is much more performant and less
> >>>>>>> resource
> >>>>>>> > hungry. But this would be an AIP on its own.
> >>>>>>> >
> >>>>>>> > Again, great design and a comprehensive AIP, but I would include
> >>>>>>> the
> >>>>>>> > caching on the webserver to greatly improve the user experience
> in
> >>>>>>> the UI.
> >>>>>>> > Looking forward to the opinion of others on this.
> >>>>>>> >
> >>>>>>> > Cheers, Fokko
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang
> >>>>>>> <zhouf...@google.com.invalid
> >>>>>>> > >:
> >>>>>>> >
> >>>>>>> > > Hi Kaxi,
> >>>>>>> > >
> >>>>>>> > > Just sent out the AIP:
> >>>>>>> > >
> >>>>>>> > >
> >>>>>>> >
> >>>>>>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> >>>>>>> > >
> >>>>>>> > > Thanks!
> >>>>>>> > > Zhou
> >>>>>>> > >
> >>>>>>> > >
> >>>>>>> > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <zhouf...@google.com
> >
> >>>>>>> wrote:
> >>>>>>> > >
> >>>>>>> > > > Hi Kaxil,
> >>>>>>> > > >
> >>>>>>> > > > We are also working on persisting DAGs into DB using JSON for
> >>>>>>> Airflow
> >>>>>>> > > > webserver in Google Composer. We target at minimizing the
> >>>>>>> change to the
> >>>>>>> > > > current Airflow code. Happy to get synced on this!
> >>>>>>> > > >
> >>>>>>> > > > Here is our progress:
> >>>>>>> > > > (1) Serializing DAGs using Pickle to be used in webserver
> >>>>>>> > > > It has been launched in Composer. I am working on the PR to
> >>>>>>> upstream
> >>>>>>> > it:
> >>>>>>> > > > https://github.com/apache/airflow/pull/5594
> >>>>>>> > > > Currently it does not support non-Airflow operators and we
> are
> >>>>>>> working
> >>>>>>> > on
> >>>>>>> > > > a fix.
> >>>>>>> > > >
> >>>>>>> > > > (2) Caching Pickled DAGs in DB to be used by webserver
> >>>>>>> > > > We have a proof-of-concept implementation, working on an AIP
> >>>>>>> now.
> >>>>>>> > > >
> >>>>>>> > > > (3) Using JSON instead of Pickle in (1) and (2)
> >>>>>>> > > > Decided to use JSON because Pickle is not secure and human
> >>>>>>> readable.
> >>>>>>> > The
> >>>>>>> > > > serialization approach is very similar to (1).
> >>>>>>> > > >
> >>>>>>> > > > I will update the RP (
> >>>>>>> https://github.com/apache/airflow/pull/5594) to
> >>>>>>> > > > replace Pickle by JSON, and send our design of (2) as an AIP
> >>>>>>> next week.
> >>>>>>> > > > Glad to check together whether our implementation makes sense
> >>>>>>> and do
> >>>>>>> > > > improvements on that.
> >>>>>>> > > >
> >>>>>>> > > > Thanks!
> >>>>>>> > > > Zhou
> >>>>>>> > > >
> >>>>>>> > > >
> >>>>>>> > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <
> >>>>>>> kaxiln...@gmail.com>
> >>>>>>> > wrote:
> >>>>>>> > > >
> >>>>>>> > > >> Hi all,
> >>>>>>> > > >>
> >>>>>>> > > >> We, at Astronomer, are going to spend time working on DAG
> >>>>>>> > Serialisation.
> >>>>>>> > > >> There are 2 AIPs that are somewhat related to what we plan
> to
> >>>>>>> work on:
> >>>>>>> > > >>
> >>>>>>> > > >>    - AIP-18 Persist all information from DAG file in DB
> >>>>>>> > > >>    <
> >>>>>>> > > >>
> >>>>>>> > >
> >>>>>>> >
> >>>>>>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
> >>>>>>> > > >> >
> >>>>>>> > > >>    - AIP-19 Making the webserver stateless
> >>>>>>> > > >>    <
> >>>>>>> > > >>
> >>>>>>> > >
> >>>>>>> >
> >>>>>>>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
> >>>>>>> > > >> >
> >>>>>>> > > >>
> >>>>>>> > > >> We plan to use JSON as the Serialisation format and store it
> >>>>>>> as a blob
> >>>>>>> > > in
> >>>>>>> > > >> metadata DB.
> >>>>>>> > > >>
> >>>>>>> > > >> *Goals:*
> >>>>>>> > > >>
> >>>>>>> > > >>    - Make Webserver Stateless
> >>>>>>> > > >>    - Use the same version of the DAG across Webserver &
> >>>>>>> Scheduler
> >>>>>>> > > >>    - Keep backward compatibility and have a flag (globally &
> >>>>>>> at DAG
> >>>>>>> > > level)
> >>>>>>> > > >>    to turn this feature on/off
> >>>>>>> > > >>    - Enable DAG Versioning (extended Goal)
> >>>>>>> > > >>
> >>>>>>> > > >>
> >>>>>>> > > >> We will be preparing a proposal (AIP) after some research
> and
> >>>>>>> some
> >>>>>>> > > initial
> >>>>>>> > > >> work and open it for the suggestions of the community.
> >>>>>>> > > >>
> >>>>>>> > > >> We already had some good brain-storming sessions with
> Twitter
> >>>>>>> folks
> >>>>>>> > > (DanD
> >>>>>>> > > >> &
> >>>>>>> > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from
> >>>>>>> Uber) which
> >>>>>>> > > >> will
> >>>>>>> > > >> be a good starting point for us.
> >>>>>>> > > >>
> >>>>>>> > > >> If anyone in the community is interested in it or has some
> >>>>>>> experience
> >>>>>>> > > >> about
> >>>>>>> > > >> the same and want to collaborate please let me know and join
> >>>>>>> > > >> #dag-serialisation channel on Airflow Slack.
> >>>>>>> > > >>
> >>>>>>> > > >> Regards,
> >>>>>>> > > >> Kaxil
> >>>>>>> > > >>
> >>>>>>> > > >
> >>>>>>> > >
> >>>>>>> >
> >>>>>>>
> >>>>>>
>

Re: Airflow DAG Serialisation

Reply via email to