[ https://issues.apache.org/jira/browse/AIRFLOW-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712555#comment-16712555 ]
Ash Berlin-Taylor commented on AIRFLOW-2229: -------------------------------------------- One thing to point out: if the fileloc is the DB points to a file that does not define the DAG then running tasks from that DAG will also likely to fail. This is because when a task is run the whole DAG bag is not loaded, just DAG(s) from the defined fileloc. > Scheduler cannot retry abrupt task failures within factory-generated DAGs > ------------------------------------------------------------------------- > > Key: AIRFLOW-2229 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2229 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler > Affects Versions: 1.9.0 > Reporter: James Meickle > Priority: Major > > We had an issue where one of our tasks failed without the worker updating > state (unclear why, but let's assume it was an OOM), resulting in this series > of error messages: > {{Mar 20 14:27:05 airflow-core-i-0fc1f995414837b8b.stg.int.dynoquant.com > airflow_scheduler-stdout.log: [2018-03-20 14:27:04,993] \{{models.py:1595 > ERROR - Executor reports task instance %s finished (%s) although the task > says its %s. Was the task killed externally?}}}} > {{ Mar 20 14:27:05 airflow-core-i-0fc1f995414837b8b.stg.int.dynoquant.com > airflow_scheduler-stdout.log: NoneType}} > {{ Mar 20 14:27:05 airflow-core-i-0fc1f995414837b8b.stg.int.dynoquant.com > airflow_scheduler-stdout.log: [2018-03-20 14:27:04,994] {{jobs.py:1435 ERROR > - Cannot load the dag bag to handle failure for <TaskInstance: > nightly_dataload.dummy_operator 2018-03-19 00:00:00 [queued]>. Setting task > to FAILED without callbacks or retries. Do you have enough resources?}}}} > Mysterious failures are not unexpected, because we are in the cloud, after > all. The concern is the last line: ignoring callbacks and retries, implying > that it's a lack of resources. However, the machine was totally underutilized > at the time. > I dug into this code a bit more and as far as I can tell this error is > happening in this code path: > [https://github.com/apache/incubator-airflow/blob/1.9.0/airflow/jobs.py#L1427] > {{self.log.error(msg)}} > {{try:}} > {{ simple_dag = simple_dag_bag.get_dag(dag_id)}} > {{ dagbag = models.DagBag(simple_dag.full_filepath)}} > {{ dag = dagbag.get_dag(dag_id)}} > {{ ti.task = dag.get_task(task_id)}} > {{ ti.handle_failure(msg)}} > {{except Exception:}} > {{ self.log.error("Cannot load the dag bag to handle failure for %s"}} > {{ ". Setting task to FAILED without callbacks or "}} > {{ "retries. Do you have enough resources?", ti)}} > {{ ti.state = State.FAILED}} > {{ session.merge(ti)}} > {{ session.commit()}}{{}} > I am not very familiar with this code, nor do I have time to attach a > debugger at the moment, but I think what is happening here is: > * I have a factory Python file, which imports and instantiates DAG code from > other files. > * The scheduler loads the DAGs from the factory file on the filesystem. It > gets a fileloc (as represented in the DB) not of the factory file, but of the > file it loaded code from. > * The scheduler makes a simple DAGBag from the instantiated DAGs. > * This line of code uses the simple DAG, which references the original DAG > object's fileloc, to create a new DAGBag object. > * This DAGBag looks for the original DAG in the fileloc, which is the file > containing that DAG's _code_, but is not actually importable by Airflow. > * An exception is raised trying to load the DAG from the DAGBag, which found > nothing. > * Handling of the task failure never occurs. > * The over-broad Exception code swallows all of the above occurring. > * There's just a generic error message that is not helpful to a system > operator. > If this is the case, at minimum, the try/except should be rewritten to be > more graceful and to have a better error message. But I question whether this > level of DAGBag abstraction/indirection isn't making this failure case worse > than it needs to be; under normal conditions the scheduler is definitely able > to find the relevant factory-generated DAGs and execute tasks within them as > expected, even with the fileloc set to the code path and not the import path. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)