I think Zhou's change is pretty much backwards-compatible with 1.10.x - it's basically optimisation that people might find really useful until 2.0 is out. I believe (correct me if I am wrong) it does not require any change from the user's perspective. Airflow will continue to behave the same way as 1.10.4, - only making it more resilient in case of huge DAG folders. It also seems battle-tested already and I think it's a small-ish effort to continue reviewing it and add the few changes that are being discussed in the PR.
I think, maybe then a viable approach is to ONLY implement it in v1-10-test branch and make a 1.10.5 release much faster than by the end of year. If we can get the 1.10.5 with only that change (and maybe few other bugk fixes ) in a month or two from now - that would be beneficial for the users - they will get much better performance faster, and they could have more time preparing for 2.0.0 backwards-incompatible migration later on. I guess it will take additional few months for anyone to migrate to 2.0.0 to convert their DAGs and do additional testing. There is a small risk (and bigger diversion between 1.10.x and 2.0.0) but I think the risk is very small here - especially that I am sure Zhou and Composer team will find and fix (and back-port) any issues found in Composer in case they find them (and this is one of the best testing grounds for Airflow ;)) . I volunteer to continue merging simple changes to 1.10. branch if we go that direction (and help with 1.10.5 release when the time comes). I kind of got used to it already and it's a bit of a habit to try to cherry-pick those changes back to the v1-10-test branch. J. On Tue, Jul 30, 2019 at 8:28 AM Kaxil Naik <kaxiln...@gmail.com> wrote: > Thanks Kevin for the input. > > I am working on this full-time as well with help from Ash. > > Persistence to DB is what we want to achieve while making Webserver > stateless and hence mitigating the issues you described. > > My opinion is that we should work towards that goal and aim to include it > in Airflow 2.0 . However I am not completely against a short-term solution > that will be completely optional, that might increase the work though. > > The diff between master and the current release branch is huge. So we > should aim to cut 2.0 by the end of this year with Dag Serialisation & > peristence in DB feature. > > We can then have minor releases directly from Master and follow SemVer as > well. > > Regards, > Kaxil > > > On Tue, Jul 30, 2019, 09:02 Kevin Yang <yrql...@gmail.com> wrote: > > > For sure! I'll try my best to squeeze some time for it during the weekend > > and see how I can help facilitate the effort( don't get the hopes up too > > much tho, got ~2 million things piled in my TODO list :P ). Will bring > this > > up to the team and see if my team can help too. > > > > You guys have showed pretty solid understanding and skills so I believe > you > > can handle it well w/o me but just in case you need me, don't hesitate to > > shoot me a direct mail. > > > > Cheers, > > Kevin Y > > > > On Mon, Jul 29, 2019 at 8:19 PM Zhou Fang <zhouf...@google.com> wrote: > > > > > Hi Kevin, it makes sense. Thanks for the explanation! Hope we can get > DB > > > persistence move faster. > > > > > > Zhou > > > > > > On Mon, Jul 29, 2019, 5:46 PM Kevin Yang <yrql...@gmail.com> wrote: > > > > > >> oops, s/consistent file/consistent file order/ > > >> > > >> On Mon, Jul 29, 2019 at 5:42 PM Kevin Yang <yrql...@gmail.com> wrote: > > >> > > >>> Hi Zhou, > > >>> > > >>> Totally understood, thank you for that. Streaming logic does cover > most > > >>> cases, tho we still have the worst cases where os.walk doesn't give > us > > >>> consistent file and file/dir additions/renames causing a different > > result > > >>> order of list_py_file_paths( e.g. right after we parsed the first > dir, > > it > > >>> was renamed and will be parsed last in the 2nd DAG loading, or we > > merged a > > >>> new file right after the file paths are collected). Maybe there's a > > way to > > >>> guarantee the order of parsing but not sure if it worth the effort > > given > > >>> that it is less of a problem if the end to end parsing time is small > > >>> enough. I understand it may be started as a short term improvement > but > > >>> since it should not be too much more complicated, we'd rather start > > with > > >>> unified long term pattern. > > >>> > > >>> Cheers, > > >>> Kevin Y > > >>> > > >>> > > >>> On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang <zhouf...@google.com> > wrote: > > >>> > > >>>> Hi Kevin, > > >>>> > > >>>> Yes. DAG persistence in DB is definitely the way to go. I referred > to > > >>>> the aysnc dag loader because it may alleviate your current problem > > (since > > >>>> it is code ready). > > >>>> > > >>>> It actually reduces the time to 15min, because DAGs are refreshed by > > >>>> the background process in a streaming way and you don't need to > > restart > > >>>> webserver per 20min. > > >>>> > > >>>> > > >>>> > > >>>> Thanks, > > >>>> Zhou > > >>>> > > >>>> > > >>>> On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang <yrql...@gmail.com> > wrote: > > >>>> > > >>>>> Hi Zhou, > > >>>>> > > >>>>> Thank you for the pointer. This solves the issue gunicorn restart > > rate > > >>>>> throttles webserver refresh rate but not the long DAG parsing time > > issue, > > >>>>> right? Worst case scenario we still wait 30 mins for the change to > > show up, > > >>>>> comparing to the previous 35 mins( I was wrong on the number, it > > should be > > >>>>> 35 mins instead of 55 mins as the clock starts whenever the > webserver > > >>>>> restarts). I believe in the previous discussion, we firstly > proposed > > this > > >>>>> local webserver DAG parsing optimization to use the same DAG > parsing > > logic > > >>>>> in scheduler to speed up the parsing. Then the stateless webserver > > proposal > > >>>>> came up and we were brought in that it is a better idea to persist > > DAGs > > >>>>> into the DB and read directly from the DB, better DAG def > > consistency and > > >>>>> webserver cluster consistency. I'm all supportive on the proposed > > structure > > >>>>> in AIP-24 but -1 on just feed webserver with a single subprocess > > parsing > > >>>>> the DAGs. I would image there won't be too many additional work to > > fetch > > >>>>> from DB instead of a subprocess, would there?( haven't look into > the > > >>>>> serialization format part but assuming they are the same/similar) > > >>>>> > > >>>>> Cheers, > > >>>>> Kevin Y > > >>>>> > > >>>>> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang <zhouf...@google.com> > > wrote: > > >>>>> > > >>>>>> Hi Kevin, > > >>>>>> > > >>>>>> The problem that DAG parsing takes a long time can be solved by > > >>>>>> Asynchronous DAG loading: > > https://github.com/apache/airflow/pull/5594 > > >>>>>> > > >>>>>> The idea is the a background process parses DAG files, and sends > > DAGs > > >>>>>> to webserver process every [webserver] dagbag_sync_interval = 10s. > > >>>>>> > > >>>>>> We have launched it in Composer, so our users can set > > >>>>>> webserver worker restart interval to 1 hour (or longer). The > > background DAG > > >>>>>> parsing processing refresh all DAGs per [webserver] = > > collect_dags_interval > > >>>>>> = 30s. > > >>>>>> > > >>>>>> If parsing all DAGs take 15min, you can see DAGs being gradually > > >>>>>> freshed with this feature. > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Zhou > > >>>>>> > > >>>>>> > > >>>>>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <yrql...@gmail.com> > > wrote: > > >>>>>> > > >>>>>>> Nice job Zhou! > > >>>>>>> > > >>>>>>> Really excited, exactly what we wanted for the webserver scaling > > >>>>>>> issue. > > >>>>>>> Want to add another big drive for Airbnb to start think about > this > > >>>>>>> previously to support the effort: it can not only bring > consistency > > >>>>>>> between > > >>>>>>> webservers but also bring consistency between webserver and > > >>>>>>> scheduler/workers. It may be less of a problem if total DAG > parsing > > >>>>>>> time is > > >>>>>>> small, but for us the total DAG parsing time is 15+ mins and we > had > > >>>>>>> to set > > >>>>>>> the webserver( gunicorn subprocesses) restart interval to 20 > mins, > > >>>>>>> which > > >>>>>>> leads to a worst case 15+20+15=50 mins delay between scheduler > > start > > >>>>>>> to > > >>>>>>> schedule things and users can see their deployed DAGs/changes... > > >>>>>>> > > >>>>>>> I'm not so sure about the scheduler performance improvement: > > >>>>>>> currently we > > >>>>>>> already feed the main scheduler process with SimpleDag through > > >>>>>>> DagFileProcessorManager running in a subprocess--in the future we > > >>>>>>> feed it > > >>>>>>> with data from DB, which is likely slower( tho the diff should > have > > >>>>>>> negligible impact to the scheduler performance). In fact if we'd > > >>>>>>> keep the > > >>>>>>> existing behavior, try schedule only fresh parsed DAGs, then we > may > > >>>>>>> need to > > >>>>>>> deal with some consistency issue--dag processor and the scheduler > > >>>>>>> race for > > >>>>>>> updating the flag indicating if the DAG is newly parsed. No big > > deal > > >>>>>>> there > > >>>>>>> but just some thoughts on the top of my head and hopefully can be > > >>>>>>> helpful. > > >>>>>>> > > >>>>>>> And good idea on pre-rendering the template, believe template > > >>>>>>> rendering was > > >>>>>>> the biggest concern in the previous discussion. We've also chose > > the > > >>>>>>> pre-rendering+JSON approach in our smart sensor API > > >>>>>>> < > > >>>>>>> > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization > > >>>>>>> > > > >>>>>>> and > > >>>>>>> seems to be working fine--a supporting case for ur proposal ;) > > >>>>>>> There's a WIP > > >>>>>>> PR <https://github.com/apache/airflow/pull/5499> for it just in > > >>>>>>> case you > > >>>>>>> are interested--maybe we can even share some logics. > > >>>>>>> > > >>>>>>> Thumbs-up again for this and please don't heisitate to reach out > if > > >>>>>>> you > > >>>>>>> want to discuss further with us or need any help from us. > > >>>>>>> > > >>>>>>> > > >>>>>>> Cheers, > > >>>>>>> Kevin Y > > >>>>>>> > > >>>>>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko > > >>>>>>> <fo...@driesprong.frl> > > >>>>>>> wrote: > > >>>>>>> > > >>>>>>> > Looks great Zhou, > > >>>>>>> > > > >>>>>>> > I have one thing that pops in my mind while reading the AIP; > > >>>>>>> should keep > > >>>>>>> > the caching on the webserver level. As the famous quote goes: > > >>>>>>> *"There are > > >>>>>>> > only two hard things in Computer Science: cache invalidation > and > > >>>>>>> naming > > >>>>>>> > things." -- Phil Karlton* > > >>>>>>> > > > >>>>>>> > Right now, the fundamental change that is being proposed in the > > >>>>>>> AIP is > > >>>>>>> > fetching the DAGs from the database in a serialized format, and > > >>>>>>> not parsing > > >>>>>>> > the Python files all the time. This will give already a great > > >>>>>>> performance > > >>>>>>> > improvement on the webserver side because it removes a lot of > the > > >>>>>>> > processing. However, since we're still fetching the DAGs from > the > > >>>>>>> database > > >>>>>>> > in a regular interval, cache it in the local process, so we > still > > >>>>>>> have the > > >>>>>>> > two issues that Airflow is suffering from right now: > > >>>>>>> > > > >>>>>>> > 1. No snappy UI because it is still polling the database in > a > > >>>>>>> regular > > >>>>>>> > interval. > > >>>>>>> > 2. Inconsistency between webservers because they might poll > > in a > > >>>>>>> > different interval, I think we've all seen this: > > >>>>>>> > https://www.youtube.com/watch?v=sNrBruPS3r4 > > >>>>>>> > > > >>>>>>> > As I also mentioned in the Slack channel, I strongly feel that > we > > >>>>>>> should be > > >>>>>>> > able to render most views from the tables in the database, so > > >>>>>>> without > > >>>>>>> > touching the blob. For specific views, we could just pull the > > blob > > >>>>>>> from the > > >>>>>>> > database. In this case we always have the latest version, and > we > > >>>>>>> tackle the > > >>>>>>> > second point above. > > >>>>>>> > > > >>>>>>> > To tackle the first one, I also have an idea. We should change > > the > > >>>>>>> DAG > > >>>>>>> > parser from a loop to something that uses inotify > > >>>>>>> > https://pypi.org/project/inotify_simple/. This will change it > > >>>>>>> from polling > > >>>>>>> > to an event-driven design, which is much more performant and > less > > >>>>>>> resource > > >>>>>>> > hungry. But this would be an AIP on its own. > > >>>>>>> > > > >>>>>>> > Again, great design and a comprehensive AIP, but I would > include > > >>>>>>> the > > >>>>>>> > caching on the webserver to greatly improve the user experience > > in > > >>>>>>> the UI. > > >>>>>>> > Looking forward to the opinion of others on this. > > >>>>>>> > > > >>>>>>> > Cheers, Fokko > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang > > >>>>>>> <zhouf...@google.com.invalid > > >>>>>>> > >: > > >>>>>>> > > > >>>>>>> > > Hi Kaxi, > > >>>>>>> > > > > >>>>>>> > > Just sent out the AIP: > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > >>>>>>> > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler > > >>>>>>> > > > > >>>>>>> > > Thanks! > > >>>>>>> > > Zhou > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang < > zhouf...@google.com > > > > > >>>>>>> wrote: > > >>>>>>> > > > > >>>>>>> > > > Hi Kaxil, > > >>>>>>> > > > > > >>>>>>> > > > We are also working on persisting DAGs into DB using JSON > for > > >>>>>>> Airflow > > >>>>>>> > > > webserver in Google Composer. We target at minimizing the > > >>>>>>> change to the > > >>>>>>> > > > current Airflow code. Happy to get synced on this! > > >>>>>>> > > > > > >>>>>>> > > > Here is our progress: > > >>>>>>> > > > (1) Serializing DAGs using Pickle to be used in webserver > > >>>>>>> > > > It has been launched in Composer. I am working on the PR to > > >>>>>>> upstream > > >>>>>>> > it: > > >>>>>>> > > > https://github.com/apache/airflow/pull/5594 > > >>>>>>> > > > Currently it does not support non-Airflow operators and we > > are > > >>>>>>> working > > >>>>>>> > on > > >>>>>>> > > > a fix. > > >>>>>>> > > > > > >>>>>>> > > > (2) Caching Pickled DAGs in DB to be used by webserver > > >>>>>>> > > > We have a proof-of-concept implementation, working on an > AIP > > >>>>>>> now. > > >>>>>>> > > > > > >>>>>>> > > > (3) Using JSON instead of Pickle in (1) and (2) > > >>>>>>> > > > Decided to use JSON because Pickle is not secure and human > > >>>>>>> readable. > > >>>>>>> > The > > >>>>>>> > > > serialization approach is very similar to (1). > > >>>>>>> > > > > > >>>>>>> > > > I will update the RP ( > > >>>>>>> https://github.com/apache/airflow/pull/5594) to > > >>>>>>> > > > replace Pickle by JSON, and send our design of (2) as an > AIP > > >>>>>>> next week. > > >>>>>>> > > > Glad to check together whether our implementation makes > sense > > >>>>>>> and do > > >>>>>>> > > > improvements on that. > > >>>>>>> > > > > > >>>>>>> > > > Thanks! > > >>>>>>> > > > Zhou > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>>> > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik < > > >>>>>>> kaxiln...@gmail.com> > > >>>>>>> > wrote: > > >>>>>>> > > > > > >>>>>>> > > >> Hi all, > > >>>>>>> > > >> > > >>>>>>> > > >> We, at Astronomer, are going to spend time working on DAG > > >>>>>>> > Serialisation. > > >>>>>>> > > >> There are 2 AIPs that are somewhat related to what we plan > > to > > >>>>>>> work on: > > >>>>>>> > > >> > > >>>>>>> > > >> - AIP-18 Persist all information from DAG file in DB > > >>>>>>> > > >> < > > >>>>>>> > > >> > > >>>>>>> > > > > >>>>>>> > > > >>>>>>> > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB > > >>>>>>> > > >> > > > >>>>>>> > > >> - AIP-19 Making the webserver stateless > > >>>>>>> > > >> < > > >>>>>>> > > >> > > >>>>>>> > > > > >>>>>>> > > > >>>>>>> > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless > > >>>>>>> > > >> > > > >>>>>>> > > >> > > >>>>>>> > > >> We plan to use JSON as the Serialisation format and store > it > > >>>>>>> as a blob > > >>>>>>> > > in > > >>>>>>> > > >> metadata DB. > > >>>>>>> > > >> > > >>>>>>> > > >> *Goals:* > > >>>>>>> > > >> > > >>>>>>> > > >> - Make Webserver Stateless > > >>>>>>> > > >> - Use the same version of the DAG across Webserver & > > >>>>>>> Scheduler > > >>>>>>> > > >> - Keep backward compatibility and have a flag > (globally & > > >>>>>>> at DAG > > >>>>>>> > > level) > > >>>>>>> > > >> to turn this feature on/off > > >>>>>>> > > >> - Enable DAG Versioning (extended Goal) > > >>>>>>> > > >> > > >>>>>>> > > >> > > >>>>>>> > > >> We will be preparing a proposal (AIP) after some research > > and > > >>>>>>> some > > >>>>>>> > > initial > > >>>>>>> > > >> work and open it for the suggestions of the community. > > >>>>>>> > > >> > > >>>>>>> > > >> We already had some good brain-storming sessions with > > Twitter > > >>>>>>> folks > > >>>>>>> > > (DanD > > >>>>>>> > > >> & > > >>>>>>> > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from > > >>>>>>> Uber) which > > >>>>>>> > > >> will > > >>>>>>> > > >> be a good starting point for us. > > >>>>>>> > > >> > > >>>>>>> > > >> If anyone in the community is interested in it or has some > > >>>>>>> experience > > >>>>>>> > > >> about > > >>>>>>> > > >> the same and want to collaborate please let me know and > join > > >>>>>>> > > >> #dag-serialisation channel on Airflow Slack. > > >>>>>>> > > >> > > >>>>>>> > > >> Regards, > > >>>>>>> > > >> Kaxil > > >>>>>>> > > >> > > >>>>>>> > > > > > >>>>>>> > > > > >>>>>>> > > > >>>>>>> > > >>>>>> > > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>