For sure! I'll try my best to squeeze some time for it during the weekend and see how I can help facilitate the effort( don't get the hopes up too much tho, got ~2 million things piled in my TODO list :P ). Will bring this up to the team and see if my team can help too.
You guys have showed pretty solid understanding and skills so I believe you can handle it well w/o me but just in case you need me, don't hesitate to shoot me a direct mail. Cheers, Kevin Y On Mon, Jul 29, 2019 at 8:19 PM Zhou Fang <zhouf...@google.com> wrote: > Hi Kevin, it makes sense. Thanks for the explanation! Hope we can get DB > persistence move faster. > > Zhou > > On Mon, Jul 29, 2019, 5:46 PM Kevin Yang <yrql...@gmail.com> wrote: > >> oops, s/consistent file/consistent file order/ >> >> On Mon, Jul 29, 2019 at 5:42 PM Kevin Yang <yrql...@gmail.com> wrote: >> >>> Hi Zhou, >>> >>> Totally understood, thank you for that. Streaming logic does cover most >>> cases, tho we still have the worst cases where os.walk doesn't give us >>> consistent file and file/dir additions/renames causing a different result >>> order of list_py_file_paths( e.g. right after we parsed the first dir, it >>> was renamed and will be parsed last in the 2nd DAG loading, or we merged a >>> new file right after the file paths are collected). Maybe there's a way to >>> guarantee the order of parsing but not sure if it worth the effort given >>> that it is less of a problem if the end to end parsing time is small >>> enough. I understand it may be started as a short term improvement but >>> since it should not be too much more complicated, we'd rather start with >>> unified long term pattern. >>> >>> Cheers, >>> Kevin Y >>> >>> >>> On Mon, Jul 29, 2019 at 3:59 PM Zhou Fang <zhouf...@google.com> wrote: >>> >>>> Hi Kevin, >>>> >>>> Yes. DAG persistence in DB is definitely the way to go. I referred to >>>> the aysnc dag loader because it may alleviate your current problem (since >>>> it is code ready). >>>> >>>> It actually reduces the time to 15min, because DAGs are refreshed by >>>> the background process in a streaming way and you don't need to restart >>>> webserver per 20min. >>>> >>>> >>>> >>>> Thanks, >>>> Zhou >>>> >>>> >>>> On Mon, Jul 29, 2019 at 3:14 PM Kevin Yang <yrql...@gmail.com> wrote: >>>> >>>>> Hi Zhou, >>>>> >>>>> Thank you for the pointer. This solves the issue gunicorn restart rate >>>>> throttles webserver refresh rate but not the long DAG parsing time issue, >>>>> right? Worst case scenario we still wait 30 mins for the change to show >>>>> up, >>>>> comparing to the previous 35 mins( I was wrong on the number, it should be >>>>> 35 mins instead of 55 mins as the clock starts whenever the webserver >>>>> restarts). I believe in the previous discussion, we firstly proposed this >>>>> local webserver DAG parsing optimization to use the same DAG parsing logic >>>>> in scheduler to speed up the parsing. Then the stateless webserver >>>>> proposal >>>>> came up and we were brought in that it is a better idea to persist DAGs >>>>> into the DB and read directly from the DB, better DAG def consistency and >>>>> webserver cluster consistency. I'm all supportive on the proposed >>>>> structure >>>>> in AIP-24 but -1 on just feed webserver with a single subprocess parsing >>>>> the DAGs. I would image there won't be too many additional work to fetch >>>>> from DB instead of a subprocess, would there?( haven't look into the >>>>> serialization format part but assuming they are the same/similar) >>>>> >>>>> Cheers, >>>>> Kevin Y >>>>> >>>>> On Mon, Jul 29, 2019 at 2:18 PM Zhou Fang <zhouf...@google.com> wrote: >>>>> >>>>>> Hi Kevin, >>>>>> >>>>>> The problem that DAG parsing takes a long time can be solved by >>>>>> Asynchronous DAG loading: https://github.com/apache/airflow/pull/5594 >>>>>> >>>>>> The idea is the a background process parses DAG files, and sends DAGs >>>>>> to webserver process every [webserver] dagbag_sync_interval = 10s. >>>>>> >>>>>> We have launched it in Composer, so our users can set >>>>>> webserver worker restart interval to 1 hour (or longer). The background >>>>>> DAG >>>>>> parsing processing refresh all DAGs per [webserver] = >>>>>> collect_dags_interval >>>>>> = 30s. >>>>>> >>>>>> If parsing all DAGs take 15min, you can see DAGs being gradually >>>>>> freshed with this feature. >>>>>> >>>>>> Thanks, >>>>>> Zhou >>>>>> >>>>>> >>>>>> On Sat, Jul 27, 2019 at 2:43 AM Kevin Yang <yrql...@gmail.com> wrote: >>>>>> >>>>>>> Nice job Zhou! >>>>>>> >>>>>>> Really excited, exactly what we wanted for the webserver scaling >>>>>>> issue. >>>>>>> Want to add another big drive for Airbnb to start think about this >>>>>>> previously to support the effort: it can not only bring consistency >>>>>>> between >>>>>>> webservers but also bring consistency between webserver and >>>>>>> scheduler/workers. It may be less of a problem if total DAG parsing >>>>>>> time is >>>>>>> small, but for us the total DAG parsing time is 15+ mins and we had >>>>>>> to set >>>>>>> the webserver( gunicorn subprocesses) restart interval to 20 mins, >>>>>>> which >>>>>>> leads to a worst case 15+20+15=50 mins delay between scheduler start >>>>>>> to >>>>>>> schedule things and users can see their deployed DAGs/changes... >>>>>>> >>>>>>> I'm not so sure about the scheduler performance improvement: >>>>>>> currently we >>>>>>> already feed the main scheduler process with SimpleDag through >>>>>>> DagFileProcessorManager running in a subprocess--in the future we >>>>>>> feed it >>>>>>> with data from DB, which is likely slower( tho the diff should have >>>>>>> negligible impact to the scheduler performance). In fact if we'd >>>>>>> keep the >>>>>>> existing behavior, try schedule only fresh parsed DAGs, then we may >>>>>>> need to >>>>>>> deal with some consistency issue--dag processor and the scheduler >>>>>>> race for >>>>>>> updating the flag indicating if the DAG is newly parsed. No big deal >>>>>>> there >>>>>>> but just some thoughts on the top of my head and hopefully can be >>>>>>> helpful. >>>>>>> >>>>>>> And good idea on pre-rendering the template, believe template >>>>>>> rendering was >>>>>>> the biggest concern in the previous discussion. We've also chose the >>>>>>> pre-rendering+JSON approach in our smart sensor API >>>>>>> < >>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-17+Airflow+sensor+optimization >>>>>>> > >>>>>>> and >>>>>>> seems to be working fine--a supporting case for ur proposal ;) >>>>>>> There's a WIP >>>>>>> PR <https://github.com/apache/airflow/pull/5499> for it just in >>>>>>> case you >>>>>>> are interested--maybe we can even share some logics. >>>>>>> >>>>>>> Thumbs-up again for this and please don't heisitate to reach out if >>>>>>> you >>>>>>> want to discuss further with us or need any help from us. >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Kevin Y >>>>>>> >>>>>>> On Sat, Jul 27, 2019 at 12:54 AM Driesprong, Fokko >>>>>>> <fo...@driesprong.frl> >>>>>>> wrote: >>>>>>> >>>>>>> > Looks great Zhou, >>>>>>> > >>>>>>> > I have one thing that pops in my mind while reading the AIP; >>>>>>> should keep >>>>>>> > the caching on the webserver level. As the famous quote goes: >>>>>>> *"There are >>>>>>> > only two hard things in Computer Science: cache invalidation and >>>>>>> naming >>>>>>> > things." -- Phil Karlton* >>>>>>> > >>>>>>> > Right now, the fundamental change that is being proposed in the >>>>>>> AIP is >>>>>>> > fetching the DAGs from the database in a serialized format, and >>>>>>> not parsing >>>>>>> > the Python files all the time. This will give already a great >>>>>>> performance >>>>>>> > improvement on the webserver side because it removes a lot of the >>>>>>> > processing. However, since we're still fetching the DAGs from the >>>>>>> database >>>>>>> > in a regular interval, cache it in the local process, so we still >>>>>>> have the >>>>>>> > two issues that Airflow is suffering from right now: >>>>>>> > >>>>>>> > 1. No snappy UI because it is still polling the database in a >>>>>>> regular >>>>>>> > interval. >>>>>>> > 2. Inconsistency between webservers because they might poll in a >>>>>>> > different interval, I think we've all seen this: >>>>>>> > https://www.youtube.com/watch?v=sNrBruPS3r4 >>>>>>> > >>>>>>> > As I also mentioned in the Slack channel, I strongly feel that we >>>>>>> should be >>>>>>> > able to render most views from the tables in the database, so >>>>>>> without >>>>>>> > touching the blob. For specific views, we could just pull the blob >>>>>>> from the >>>>>>> > database. In this case we always have the latest version, and we >>>>>>> tackle the >>>>>>> > second point above. >>>>>>> > >>>>>>> > To tackle the first one, I also have an idea. We should change the >>>>>>> DAG >>>>>>> > parser from a loop to something that uses inotify >>>>>>> > https://pypi.org/project/inotify_simple/. This will change it >>>>>>> from polling >>>>>>> > to an event-driven design, which is much more performant and less >>>>>>> resource >>>>>>> > hungry. But this would be an AIP on its own. >>>>>>> > >>>>>>> > Again, great design and a comprehensive AIP, but I would include >>>>>>> the >>>>>>> > caching on the webserver to greatly improve the user experience in >>>>>>> the UI. >>>>>>> > Looking forward to the opinion of others on this. >>>>>>> > >>>>>>> > Cheers, Fokko >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > Op za 27 jul. 2019 om 01:44 schreef Zhou Fang >>>>>>> <zhouf...@google.com.invalid >>>>>>> > >: >>>>>>> > >>>>>>> > > Hi Kaxi, >>>>>>> > > >>>>>>> > > Just sent out the AIP: >>>>>>> > > >>>>>>> > > >>>>>>> > >>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler >>>>>>> > > >>>>>>> > > Thanks! >>>>>>> > > Zhou >>>>>>> > > >>>>>>> > > >>>>>>> > > On Fri, Jul 26, 2019 at 1:33 PM Zhou Fang <zhouf...@google.com> >>>>>>> wrote: >>>>>>> > > >>>>>>> > > > Hi Kaxil, >>>>>>> > > > >>>>>>> > > > We are also working on persisting DAGs into DB using JSON for >>>>>>> Airflow >>>>>>> > > > webserver in Google Composer. We target at minimizing the >>>>>>> change to the >>>>>>> > > > current Airflow code. Happy to get synced on this! >>>>>>> > > > >>>>>>> > > > Here is our progress: >>>>>>> > > > (1) Serializing DAGs using Pickle to be used in webserver >>>>>>> > > > It has been launched in Composer. I am working on the PR to >>>>>>> upstream >>>>>>> > it: >>>>>>> > > > https://github.com/apache/airflow/pull/5594 >>>>>>> > > > Currently it does not support non-Airflow operators and we are >>>>>>> working >>>>>>> > on >>>>>>> > > > a fix. >>>>>>> > > > >>>>>>> > > > (2) Caching Pickled DAGs in DB to be used by webserver >>>>>>> > > > We have a proof-of-concept implementation, working on an AIP >>>>>>> now. >>>>>>> > > > >>>>>>> > > > (3) Using JSON instead of Pickle in (1) and (2) >>>>>>> > > > Decided to use JSON because Pickle is not secure and human >>>>>>> readable. >>>>>>> > The >>>>>>> > > > serialization approach is very similar to (1). >>>>>>> > > > >>>>>>> > > > I will update the RP ( >>>>>>> https://github.com/apache/airflow/pull/5594) to >>>>>>> > > > replace Pickle by JSON, and send our design of (2) as an AIP >>>>>>> next week. >>>>>>> > > > Glad to check together whether our implementation makes sense >>>>>>> and do >>>>>>> > > > improvements on that. >>>>>>> > > > >>>>>>> > > > Thanks! >>>>>>> > > > Zhou >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > On Fri, Jul 26, 2019 at 7:37 AM Kaxil Naik < >>>>>>> kaxiln...@gmail.com> >>>>>>> > wrote: >>>>>>> > > > >>>>>>> > > >> Hi all, >>>>>>> > > >> >>>>>>> > > >> We, at Astronomer, are going to spend time working on DAG >>>>>>> > Serialisation. >>>>>>> > > >> There are 2 AIPs that are somewhat related to what we plan to >>>>>>> work on: >>>>>>> > > >> >>>>>>> > > >> - AIP-18 Persist all information from DAG file in DB >>>>>>> > > >> < >>>>>>> > > >> >>>>>>> > > >>>>>>> > >>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-18+Persist+all+information+from+DAG+file+in+DB >>>>>>> > > >> > >>>>>>> > > >> - AIP-19 Making the webserver stateless >>>>>>> > > >> < >>>>>>> > > >> >>>>>>> > > >>>>>>> > >>>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-19+Making+the+webserver+stateless >>>>>>> > > >> > >>>>>>> > > >> >>>>>>> > > >> We plan to use JSON as the Serialisation format and store it >>>>>>> as a blob >>>>>>> > > in >>>>>>> > > >> metadata DB. >>>>>>> > > >> >>>>>>> > > >> *Goals:* >>>>>>> > > >> >>>>>>> > > >> - Make Webserver Stateless >>>>>>> > > >> - Use the same version of the DAG across Webserver & >>>>>>> Scheduler >>>>>>> > > >> - Keep backward compatibility and have a flag (globally & >>>>>>> at DAG >>>>>>> > > level) >>>>>>> > > >> to turn this feature on/off >>>>>>> > > >> - Enable DAG Versioning (extended Goal) >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> We will be preparing a proposal (AIP) after some research and >>>>>>> some >>>>>>> > > initial >>>>>>> > > >> work and open it for the suggestions of the community. >>>>>>> > > >> >>>>>>> > > >> We already had some good brain-storming sessions with Twitter >>>>>>> folks >>>>>>> > > (DanD >>>>>>> > > >> & >>>>>>> > > >> Sumit), folks from GoDataDriven (Fokko & Bas) & Alex (from >>>>>>> Uber) which >>>>>>> > > >> will >>>>>>> > > >> be a good starting point for us. >>>>>>> > > >> >>>>>>> > > >> If anyone in the community is interested in it or has some >>>>>>> experience >>>>>>> > > >> about >>>>>>> > > >> the same and want to collaborate please let me know and join >>>>>>> > > >> #dag-serialisation channel on Airflow Slack. >>>>>>> > > >> >>>>>>> > > >> Regards, >>>>>>> > > >> Kaxil >>>>>>> > > >> >>>>>>> > > > >>>>>>> > > >>>>>>> > >>>>>>> >>>>>>