Hi Kaxil, Jarek and Ash, Here is the AIP https://cwiki.apache.org/confluence/display/AIRFLOW/Remove+double+dag+parsing+in+airflow+run. Looking forward to your feedback.
Thanks, Ping On Mon, Dec 20, 2021 at 10:46 AM Ping Zhang <[email protected]> wrote: > Hi Kaxil, > > Thanks for the comment. The serialized_dag isn't used to run the task in > the `airflow run --raw` process. It is used in the `airflow run --local` to > perform `check_and_change_state_before_execution` > > https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L88-L99 > > > Thanks, > > Ping > > > On Mon, Dec 20, 2021 at 4:51 AM Kaxil Naik <[email protected]> wrote: > >> Yup, forking only applies when os.fork is available and run_as_user >> isn't specified. We had only added enough details in Serialized DAGs that >> are needed for the Webserver and to make any scheduling decisions in the >> Scheduler. >> >> So it does not contain all the information (all the args, kwargs >> including callables) required to run the task. >> >> Looking forward for the AIP. >> >> Regards, >> Kaxil >> >> On Fri, Dec 17, 2021 at 11:04 PM Ping Zhang <[email protected]> wrote: >> >>> Hi Ash, >>> >>> Thanks for the inputs about the fork approach. I have checked the code. >>> The fork only applies when there is no run_as_user. I think the run_as_user >>> is an important feature. >>> >>> I will create an AIP with more details. >>> >>> Best wishes >>> >>> Ping Zhang >>> >>> >>> On Fri, Dec 17, 2021 at 9:59 AM Jarek Potiuk <[email protected]> wrote: >>> >>>> Yeah. I would also love to see some details in the meeting I proposed >>>> :). I am particularly interested about the current limitation of the >>>> solution in "general" case. >>>> >>>> J, >>>> >>>> On Fri, Dec 17, 2021 at 11:16 AM Ash Berlin-Taylor <[email protected]> >>>> wrote: >>>> > >>>> > On Thu, Dec 16 2021 at 16:19:45 -0800, Ping Zhang <[email protected]> >>>> wrote: >>>> > >>>> > To run airflow tasks, airflow needs to parse dag file twice, once in >>>> airflow run local process, once in airflow run raw >>>> > >>>> > >>>> > This isn't true in most cases anymore thanks to a change from >>>> spawning a new process (os.exec(["airflow",...]) to fork instead. >>>> > >>>> > The serialized_dag table doesn't (currently) contain enough >>>> information to actually execute every dag, especially in the case of >>>> PythonOperator, so the actual dag file on disk needs to be loaded to get >>>> code to run, so perhaps it would be possible to do this for some operators, >>>> but not all. >>>> > >>>> > Still might be worth looking at it and I'm looking forward to the >>>> proposal! >>>> > >>>> > -ash >>>> >>>
