coufon commented on a change in pull request #5594: [AIRFLOW-4924] Loading DAGs asynchronously in Airflow webserver URL: https://github.com/apache/airflow/pull/5594#discussion_r303715835
########## File path: airflow/dag/stringified_dags.py ########## @@ -0,0 +1,137 @@ +# -*- coding: utf-8 -*- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +# DagCached is a new feature in Airflow that caches processed DAGs in Airflow database. +# DAGs are stringified first and seriailized by Pickle to be stored in database. +# Stringified DAGs holds metadata of original DAGs and tasks, and can be used by +# Airflow webserver and scheduler. + +"""Methods to stringify DAGs and tasks to be compatible with pickle.""" + +import copy +import functools +import inspect +import logging + +from airflow import models + + +# Stringify all fields of DAGs and tasks except for time related fields. +_dag_fields_to_keep = set([ + 'schedule_interval', 'start_date', 'end_date', 'dagrun_timeout', + 'timezone', 'last_loaded', '_schedule_interval', 'test_field']) + +_task_fields_to_keep = set([ + 'retry_delay', 'max_retry_delay', 'start_date', 'end_date', + 'schedule_interval', 'sla', 'execution_timeout']) + +_primitive_types = (int, bool, float, str, bytes) + + +def _is_primitive(x): + return x is None or isinstance(x, _primitive_types) + + +def _stringify_dag_or_task(x, stringified_dags, is_dag): + """Returns a stringified DAG or task.""" + if is_dag and x.dag_id in stringified_dags: + return stringified_dags[x.dag_id] + + # Cast any operators defined in non-airlfow modules to BaseOperator to ensure + # unpickle is successful. The downside is that the task will be displayed as + # BaseOperator in UI. + if not is_dag and not x.__class__.__module__.startswith('airflow.operators'): Review comment: > Recently, Jarek Potiuk fixed a bug related to multithreaded in this project > #5200 > #5199 Sorry for the confusion. We would like to have the same implementation for Composer and Airflow. Except for code reorganizing to pass pylint, the only change is that in Composer, the stringify method modify a DAG or a task in-place. Here it returns a deep-copy. By returning a deep copy, the benefit is that if a task uses a customer defined operator (can not be pickled, so can not be sent in multiprocess queue), we can replace it by models.BaseOperator. The Composer implementation ignores any DAGs that can not be pickled. Yes. I agree the multiprocess here is non-trivial. We can add more tests to the unit test tests/www/async_dag_loaders.py. I will check Jarek's debugging to see whether we have similar issues. Thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
