Hi Kaxil, Jarek and Ash,

Here is the AIP
https://cwiki.apache.org/confluence/display/AIRFLOW/Remove+double+dag+parsing+in+airflow+run.
Looking forward to your feedback.

Thanks,

Ping


On Mon, Dec 20, 2021 at 10:46 AM Ping Zhang <[email protected]> wrote:

> Hi Kaxil,
>
> Thanks for the comment. The serialized_dag isn't used to run the task in
> the `airflow run --raw` process. It is used in the `airflow run --local` to
> perform `check_and_change_state_before_execution`
>
> https://github.com/apache/airflow/blob/main/airflow/jobs/local_task_job.py#L88-L99
>
>
> Thanks,
>
> Ping
>
>
> On Mon, Dec 20, 2021 at 4:51 AM Kaxil Naik <[email protected]> wrote:
>
>> Yup, forking only applies when os.fork is available and run_as_user
>> isn't specified. We had only added enough details in Serialized DAGs that
>> are needed for the Webserver and to make any scheduling decisions in the
>> Scheduler.
>>
>> So it does not contain all the information (all the args, kwargs
>> including callables) required to run the task.
>>
>> Looking forward for the AIP.
>>
>> Regards,
>> Kaxil
>>
>> On Fri, Dec 17, 2021 at 11:04 PM Ping Zhang <[email protected]> wrote:
>>
>>> Hi Ash,
>>>
>>> Thanks for the inputs about the fork approach. I have checked the code.
>>> The fork only applies when there is no run_as_user. I think the run_as_user
>>> is an important feature.
>>>
>>> I will create an AIP with more details.
>>>
>>> Best wishes
>>>
>>> Ping Zhang
>>>
>>>
>>> On Fri, Dec 17, 2021 at 9:59 AM Jarek Potiuk <[email protected]> wrote:
>>>
>>>> Yeah. I would also love to see some details in the meeting I proposed
>>>> :). I am particularly interested about the current limitation of the
>>>> solution in "general" case.
>>>>
>>>> J,
>>>>
>>>> On Fri, Dec 17, 2021 at 11:16 AM Ash Berlin-Taylor <[email protected]>
>>>> wrote:
>>>> >
>>>> > On Thu, Dec 16 2021 at 16:19:45 -0800, Ping Zhang <[email protected]>
>>>> wrote:
>>>> >
>>>> > To run airflow tasks, airflow needs to parse dag file twice, once in
>>>> airflow run local process, once in airflow run raw
>>>> >
>>>> >
>>>> > This isn't true in most cases anymore thanks to a change from
>>>> spawning a new process (os.exec(["airflow",...]) to fork instead.
>>>> >
>>>> > The serialized_dag table doesn't (currently) contain enough
>>>> information to actually execute every dag, especially in the case of
>>>> PythonOperator, so the actual dag file on disk needs to be loaded to get
>>>> code to run, so perhaps it would be possible to do this for some operators,
>>>> but not all.
>>>> >
>>>> > Still might be worth looking at it and I'm looking forward to the
>>>> proposal!
>>>> >
>>>> > -ash
>>>>
>>>

Reply via email to