Re: Airflow DAG Serialisation

Jarek Potiuk Tue, 30 Jul 2019 01:25:08 -0700

I think Zhou's change is pretty much backwards-compatible with 1.10.x -
it's basically optimisation that people might find really useful until 2.0
is out. I believe (correct me if I am wrong) it does not require any change
from the user's perspective. Airflow will continue to behave the same way
as 1.10.4, - only making it more resilient in case of huge DAG folders. It
also seems battle-tested already and I think it's a small-ish effort to
continue reviewing it and add the few changes that are being discussed in
the PR.


I think, maybe then a viable approach is to ONLY implement it in v1-10-test
branch and make a 1.10.5 release much faster than by the end of year. If we
can get the 1.10.5 with only that change (and maybe few other bugk fixes )
in a month or two from now - that would be beneficial for the users - they
will get much better performance faster, and they could have more time
preparing for 2.0.0 backwards-incompatible migration later on. I guess it
will take additional few months for anyone to migrate to 2.0.0 to convert
their DAGs and do additional testing.

There is a small risk (and bigger diversion between 1.10.x and 2.0.0) but I
think the risk is very small here - especially that I am sure Zhou and
Composer team will find and fix (and back-port) any issues found in
Composer in case they find them (and this is one of the best testing
grounds for Airflow ;)) .

I volunteer to continue merging simple changes to 1.10. branch if we go
that direction (and help with 1.10.5 release when the time comes). I kind
of got used to it already and it's a bit of a habit to try to cherry-pick
those changes back to the v1-10-test branch.

J.

On Tue, Jul 30, 2019 at 8:28 AM Kaxil Naik <kaxiln...@gmail.com> wrote:

> Thanks Kevin for the input.
>
> I am working on this full-time as well with help from Ash.
>
> Persistence to DB is what we want to achieve while making Webserver
> stateless and hence mitigating the issues you described.
>
> My opinion is that we should work towards that goal and aim to include it
> in Airflow 2.0 . However I am not completely against a short-term solution
> that will be completely optional, that might increase the work though.
>
> The diff between master and the current release branch is huge. So we
> should aim to cut 2.0 by the end of this year with Dag Serialisation &
> peristence in DB feature.
>
> We can then have minor releases directly from Master and follow SemVer as
> well.
>
> Regards,
> Kaxil
>
>
> On Tue, Jul 30, 2019, 09:02 Kevin Yang <yrql...@gmail.com> wrote:
>
> > For sure! I'll try my best to squeeze some time for it during the weekend
> > and see how I can help facilitate the effort( don't get the hopes up too
> > much tho, got ~2 million things piled in my TODO list :P ). Will bring
> this
> > up to the team and see if my team can help too.
> >
> > You guys have showed pretty solid understanding and skills so I believe
> you
> > can handle it well w/o me but just in case you need me, don't hesitate to
> > shoot me a direct mail.
> >
> > Cheers,
> > Kevin Y
> >
> > On Mon, Jul 29, 2019 at 8:19 PM Zhou Fang <zhouf...@google.com> wrote:
> >
> > > Hi Kevin, it makes sense. Thanks for the explanation! Hope we can get
> DB
> > > persistence move faster.
> > >
> > > Zhou
> > >
> > > On Mon, Jul 29, 2019, 5:46 PM Kevin Yang <yrql...@gmail.com> wrote:
> > >
> > >> oops, s/consistent file/consistent file order/
> > >>
> > >> On Mon, Jul 29, 2019 at 5:42 PM Kevin Yang <yrql...@gmail.com> wrote:
> > >>
> > >>> Hi Zhou,
> > >>>
> > >>> Totally understood, thank you for that. Streaming logic does cover
> most
> > >>> cases, tho we still have the worst cases where os.walk doesn't give
> us
> > >>> consistent file and file/dir additions/renames causing a different
> > result
> > >>> order of list_py_file_paths( e.g. right after we parsed the first
> dir,
> > it
> > >>> was renamed and will be parsed last in the 2nd DAG loading, or we
> > merged a
> > >>> new file right after the file paths are collected). Maybe there's a
> > way to
> > >>> guarantee the order of parsing but not sure if it worth the effort
> > given
> > >>> that it is less of a problem if the end to end parsing time is small
> > >>> enough. I understand it may be started as a short term improvement
> but
> > >>> since it should not be too much more complicated, we'd rather start
> > with
> > >>> unified long term pattern.
> > >>>
> > >>> Cheers,
> > >>> Kevin Y
> > >>>
> > >>>
> > >>> On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang <zhouf...@google.com>
> wrote:
> > >>>
> > >>>> Hi Kevin,
> > >>>>
> > >>>> Yes. DAG persistence in DB is definitely the way to go. I referred
> to
> > >>>> the aysnc dag loader because it may alleviate your current problem
> > (since
> > >>>> it is code ready).
> > >>>>
> > >>>> It actually reduces the time to 15min, because DAGs are refreshed by
> > >>>> the background process in a streaming way and you don't need to
> > restart
> > >>>> webserver per 20min.
> > >>>>
> > >>>>
> > >>>>
> > >>>> Thanks,
> > >>>> Zhou
> > >>>>
> > >>>>
> > >>>> On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang <yrql...@gmail.com>
> wrote:
> > >>>>
> > >>>>> Hi Zhou,
> > >>>>>
> > >>>>> Thank you for the pointer. This solves the issue gunicorn restart
> > rate
> > >>>>> throttles webserver refresh rate but not the long DAG parsing time
> > issue,
> > >>>>> right? Worst case scenario we still wait 30 mins for the change to
> > show up,
> > >>>>> comparing to the previous 35 mins( I was wrong on the number, it
> > should be
> > >>>>> 35 mins instead of 55 mins as the clock starts whenever the
> webserver
> > >>>>> restarts). I believe in the previous discussion, we firstly
> proposed
> > this
> > >>>>> local webserver DAG parsing optimization to use the same DAG
> parsing
> > logic
> > >>>>> in scheduler to speed up the parsing. Then the stateless webserver
> > proposal
> > >>>>> came up and we were brought in that it is a better idea to persist
> > DAGs
> > >>>>> into the DB and read directly from the DB, better DAG def
> > consistency and
> > >>>>> webserver cluster consistency. I'm all supportive on the proposed
> > structure
> > >>>>> in AIP-24 but -1 on just feed webserver with a single subprocess
> > parsing
> > >>>>> the DAGs. I would image there won't be too many additional work to
> > fetch
> > >>>>> from DB instead of a subprocess, would there?( haven't look into
> the
> > >>>>> serialization format part but assuming they are the same/similar)
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Kevin Y
> > >>>>>
> > >>>>> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang <zhouf...@google.com>
> > wrote:
> > >>>>>
> > >>>>>> Hi Kevin,
> > >>>>>>
> > >>>>>> The problem that DAG parsing takes a long time can be solved by
> > >>>>>> Asynchronous DAG loading:
> > https://github.com/apache/airflow/pull/5594
> > >>>>>>
> > >>>>>> The idea is the a background process parses DAG files, and sends
> > DAGs
> > >>>>>> to webserver process every [webserver] dagbag_sync_interval = 10s.
> > >>>>>>
> > >>>>>> We have launched it in Composer, so our users can set
> > >>>>>> webserver worker restart interval to 1 hour (or longer). The
> > background DAG
> > >>>>>> parsing processing refresh all DAGs per [webserver] =
> > collect_dags_interval
> > >>>>>> = 30s.
> > >>>>>>
> > >>>>>> If parsing all DAGs take 15min, you can see DAGs being gradually
> > >>>>>> freshed with this feature.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Zhou
> > >>>>>>
> > >>>>>>
> > >>>>>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <yrql...@gmail.com>
> > wrote:
> > >>>>>>
> > >>>>>>> Nice job Zhou!
> > >>>>>>>
> > >>>>>>> Really excited, exactly what we wanted for the webserver scaling
> > >>>>>>> issue.
> > >>>>>>> Want to add another big drive for Airbnb to start think about
> this
> > >>>>>>> previously to support the effort: it can not only bring
> consistency
> > >>>>>>> between
> > >>>>>>> webservers but also bring consistency between webserver and
> > >>>>>>> scheduler/workers. It may be less of a problem if total DAG
> parsing
> > >>>>>>> time is
> > >>>>>>> small, but for us the total DAG parsing time is 15+ mins and we
> had
> > >>>>>>> to set
> > >>>>>>> the webserver( gunicorn subprocesses) restart interval to 20
> mins,
> > >>>>>>> which
> > >>>>>>> leads to a worst case 15+20+15=50 mins delay between scheduler
> > start
> > >>>>>>> to
> > >>>>>>> schedule things and users can see their deployed DAGs/changes...
> > >>>>>>>
> > >>>>>>> I'm not so sure about the scheduler performance improvement:
> > >>>>>>> currently we
> > >>>>>>> already feed the main scheduler process with SimpleDag through
> > >>>>>>> DagFileProcessorManager running in a subprocess--in the future we
> > >>>>>>> feed it
> > >>>>>>> with data from DB, which is likely slower( tho the diff should
> have
> > >>>>>>> negligible impact to the scheduler performance). In fact if we'd
> > >>>>>>> keep the
> > >>>>>>> existing behavior, try schedule only fresh parsed DAGs, then we
> may
> > >>>>>>> need to
> > >>>>>>> deal with some consistency issue--dag processor and the scheduler
> > >>>>>>> race for
> > >>>>>>> updating the flag indicating if the DAG is newly parsed. No big
> > deal
> > >>>>>>> there
> > >>>>>>> but just some thoughts on the top of my head and hopefully can be
> > >>>>>>> helpful.
> > >>>>>>>
> > >>>>>>> And good idea on pre-rendering the template, believe template
> > >>>>>>> rendering was
> > >>>>>>> the biggest concern in the previous discussion. We've also chose
> > the
> > >>>>>>> pre-rendering+JSON approach in our smart sensor API
> > >>>>>>> <
> > >>>>>>>
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization
> > >>>>>>> >
> > >>>>>>> and
> > >>>>>>> seems to be working fine--a supporting case for ur proposal ;)
> > >>>>>>> There's a WIP
> > >>>>>>> PR <https://github.com/apache/airflow/pull/5499> for it just in
> > >>>>>>> case you
> > >>>>>>> are interested--maybe we can even share some logics.
> > >>>>>>>
> > >>>>>>> Thumbs-up again for this and please don't heisitate to reach out
> if
> > >>>>>>> you
> > >>>>>>> want to discuss further with us or need any help from us.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>> Kevin Y
> > >>>>>>>
> > >>>>>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko
> > >>>>>>> <fo...@driesprong.frl>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> > Looks great Zhou,
> > >>>>>>> >
> > >>>>>>> > I have one thing that pops in my mind while reading the AIP;
> > >>>>>>> should keep
> > >>>>>>> > the caching on the webserver level. As the famous quote goes:
> > >>>>>>> *"There are
> > >>>>>>> > only two hard things in Computer Science: cache invalidation
> and
> > >>>>>>> naming
> > >>>>>>> > things." -- Phil Karlton*
> > >>>>>>> >
> > >>>>>>> > Right now, the fundamental change that is being proposed in the
> > >>>>>>> AIP is
> > >>>>>>> > fetching the DAGs from the database in a serialized format, and
> > >>>>>>> not parsing
> > >>>>>>> > the Python files all the time. This will give already a great
> > >>>>>>> performance
> > >>>>>>> > improvement on the webserver side because it removes a lot of
> the
> > >>>>>>> > processing. However, since we're still fetching the DAGs from
> the
> > >>>>>>> database
> > >>>>>>> > in a regular interval, cache it in the local process, so we
> still
> > >>>>>>> have the
> > >>>>>>> > two issues that Airflow is suffering from right now:
> > >>>>>>> >
> > >>>>>>> >    1. No snappy UI because it is still polling the database in
> a
> > >>>>>>> regular
> > >>>>>>> >    interval.
> > >>>>>>> >    2. Inconsistency between webservers because they might poll
> > in a
> > >>>>>>> >    different interval, I think we've all seen this:
> > >>>>>>> >    https://www.youtube.com/watch?v=sNrBruPS3r4
> > >>>>>>> >
> > >>>>>>> > As I also mentioned in the Slack channel, I strongly feel that
> we
> > >>>>>>> should be
> > >>>>>>> > able to render most views from the tables in the database, so
> > >>>>>>> without
> > >>>>>>> > touching the blob. For specific views, we could just pull the
> > blob
> > >>>>>>> from the
> > >>>>>>> > database. In this case we always have the latest version, and
> we
> > >>>>>>> tackle the
> > >>>>>>> > second point above.
> > >>>>>>> >
> > >>>>>>> > To tackle the first one, I also have an idea. We should change
> > the
> > >>>>>>> DAG
> > >>>>>>> > parser from a loop to something that uses inotify
> > >>>>>>> > https://pypi.org/project/inotify_simple/. This will change it
> > >>>>>>> from polling
> > >>>>>>> > to an event-driven design, which is much more performant and
> less
> > >>>>>>> resource
> > >>>>>>> > hungry. But this would be an AIP on its own.
> > >>>>>>> >
> > >>>>>>> > Again, great design and a comprehensive AIP, but I would
> include
> > >>>>>>> the
> > >>>>>>> > caching on the webserver to greatly improve the user experience
> > in
> > >>>>>>> the UI.
> > >>>>>>> > Looking forward to the opinion of others on this.
> > >>>>>>> >
> > >>>>>>> > Cheers, Fokko
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> >
> > >>>>>>> > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang
> > >>>>>>> <zhouf...@google.com.invalid
> > >>>>>>> > >:
> > >>>>>>> >
> > >>>>>>> > > Hi Kaxi,
> > >>>>>>> > >
> > >>>>>>> > > Just sent out the AIP:
> > >>>>>>> > >
> > >>>>>>> > >
> > >>>>>>> >
> > >>>>>>>
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> > >>>>>>> > >
> > >>>>>>> > > Thanks!
> > >>>>>>> > > Zhou
> > >>>>>>> > >
> > >>>>>>> > >
> > >>>>>>> > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <
> zhouf...@google.com
> > >
> > >>>>>>> wrote:
> > >>>>>>> > >
> > >>>>>>> > > > Hi Kaxil,
> > >>>>>>> > > >
> > >>>>>>> > > > We are also working on persisting DAGs into DB using JSON
> for
> > >>>>>>> Airflow
> > >>>>>>> > > > webserver in Google Composer. We target at minimizing the
> > >>>>>>> change to the
> > >>>>>>> > > > current Airflow code. Happy to get synced on this!
> > >>>>>>> > > >
> > >>>>>>> > > > Here is our progress:
> > >>>>>>> > > > (1) Serializing DAGs using Pickle to be used in webserver
> > >>>>>>> > > > It has been launched in Composer. I am working on the PR to
> > >>>>>>> upstream
> > >>>>>>> > it:
> > >>>>>>> > > > https://github.com/apache/airflow/pull/5594
> > >>>>>>> > > > Currently it does not support non-Airflow operators and we
> > are
> > >>>>>>> working
> > >>>>>>> > on
> > >>>>>>> > > > a fix.
> > >>>>>>> > > >
> > >>>>>>> > > > (2) Caching Pickled DAGs in DB to be used by webserver
> > >>>>>>> > > > We have a proof-of-concept implementation, working on an
> AIP
> > >>>>>>> now.
> > >>>>>>> > > >
> > >>>>>>> > > > (3) Using JSON instead of Pickle in (1) and (2)
> > >>>>>>> > > > Decided to use JSON because Pickle is not secure and human
> > >>>>>>> readable.
> > >>>>>>> > The
> > >>>>>>> > > > serialization approach is very similar to (1).
> > >>>>>>> > > >
> > >>>>>>> > > > I will update the RP (
> > >>>>>>> https://github.com/apache/airflow/pull/5594) to
> > >>>>>>> > > > replace Pickle by JSON, and send our design of (2) as an
> AIP
> > >>>>>>> next week.
> > >>>>>>> > > > Glad to check together whether our implementation makes
> sense
> > >>>>>>> and do
> > >>>>>>> > > > improvements on that.
> > >>>>>>> > > >
> > >>>>>>> > > > Thanks!
> > >>>>>>> > > > Zhou
> > >>>>>>> > > >
> > >>>>>>> > > >
> > >>>>>>> > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik <
> > >>>>>>> kaxiln...@gmail.com>
> > >>>>>>> > wrote:
> > >>>>>>> > > >
> > >>>>>>> > > >> Hi all,
> > >>>>>>> > > >>
> > >>>>>>> > > >> We, at Astronomer, are going to spend time working on DAG
> > >>>>>>> > Serialisation.
> > >>>>>>> > > >> There are 2 AIPs that are somewhat related to what we plan
> > to
> > >>>>>>> work on:
> > >>>>>>> > > >>
> > >>>>>>> > > >>    - AIP-18 Persist all information from DAG file in DB
> > >>>>>>> > > >>    <
> > >>>>>>> > > >>
> > >>>>>>> > >
> > >>>>>>> >
> > >>>>>>>
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB
> > >>>>>>> > > >> >
> > >>>>>>> > > >>    - AIP-19 Making the webserver stateless
> > >>>>>>> > > >>    <
> > >>>>>>> > > >>
> > >>>>>>> > >
> > >>>>>>> >
> > >>>>>>>
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless
> > >>>>>>> > > >> >
> > >>>>>>> > > >>
> > >>>>>>> > > >> We plan to use JSON as the Serialisation format and store
> it
> > >>>>>>> as a blob
> > >>>>>>> > > in
> > >>>>>>> > > >> metadata DB.
> > >>>>>>> > > >>
> > >>>>>>> > > >> *Goals:*
> > >>>>>>> > > >>
> > >>>>>>> > > >>    - Make Webserver Stateless
> > >>>>>>> > > >>    - Use the same version of the DAG across Webserver &
> > >>>>>>> Scheduler
> > >>>>>>> > > >>    - Keep backward compatibility and have a flag
> (globally &
> > >>>>>>> at DAG
> > >>>>>>> > > level)
> > >>>>>>> > > >>    to turn this feature on/off
> > >>>>>>> > > >>    - Enable DAG Versioning (extended Goal)
> > >>>>>>> > > >>
> > >>>>>>> > > >>
> > >>>>>>> > > >> We will be preparing a proposal (AIP) after some research
> > and
> > >>>>>>> some
> > >>>>>>> > > initial
> > >>>>>>> > > >> work and open it for the suggestions of the community.
> > >>>>>>> > > >>
> > >>>>>>> > > >> We already had some good brain-storming sessions with
> > Twitter
> > >>>>>>> folks
> > >>>>>>> > > (DanD
> > >>>>>>> > > >> &
> > >>>>>>> > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from
> > >>>>>>> Uber) which
> > >>>>>>> > > >> will
> > >>>>>>> > > >> be a good starting point for us.
> > >>>>>>> > > >>
> > >>>>>>> > > >> If anyone in the community is interested in it or has some
> > >>>>>>> experience
> > >>>>>>> > > >> about
> > >>>>>>> > > >> the same and want to collaborate please let me know and
> join
> > >>>>>>> > > >> #dag-serialisation channel on Airflow Slack.
> > >>>>>>> > > >>
> > >>>>>>> > > >> Regards,
> > >>>>>>> > > >> Kaxil
> > >>>>>>> > > >>
> > >>>>>>> > > >
> > >>>>>>> > >
> > >>>>>>> >
> > >>>>>>>
> > >>>>>>
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Airflow DAG Serialisation

Reply via email to