[
https://issues.apache.org/jira/browse/AIRFLOW-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466645#comment-16466645
]
sunil kumar commented on AIRFLOW-2229:
--------------------------------------
I'm encountering task failing with similar error message. Could it be, executor
is unable to get right status for the task from broker?
[2018-05-05 13:41:43,111] \{jobs.py:1425} ERROR - Executor reports task
instance %s finished (%s) although the task says its %s. Was the task killed
externally?
[2018-05-05 13:41:46,364] \{jobs.py:1435} ERROR - Cannot load the dag bag to
handle failure for <TaskInstance:
repoman_v_02.extract_tablename_acct____prg_id_300 2018-05-05 13:40:10
> Scheduler cannot retry abrupt task failures within factory-generated DAGs
> -------------------------------------------------------------------------
>
> Key: AIRFLOW-2229
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2229
> Project: Apache Airflow
> Issue Type: Bug
> Components: scheduler
> Affects Versions: 1.9.0
> Reporter: James Meickle
> Priority: Major
>
> We had an issue where one of our tasks failed without the worker updating
> state (unclear why, but let's assume it was an OOM), resulting in this series
> of error messages:
> {{Mar 20 14:27:05 airflow-core-i-0fc1f995414837b8b.stg.int.dynoquant.com
> airflow_scheduler-stdout.log: [2018-03-20 14:27:04,993] \{{models.py:1595
> ERROR - Executor reports task instance %s finished (%s) although the task
> says its %s. Was the task killed externally?}}}}
> {{ Mar 20 14:27:05 airflow-core-i-0fc1f995414837b8b.stg.int.dynoquant.com
> airflow_scheduler-stdout.log: NoneType}}
> {{ Mar 20 14:27:05 airflow-core-i-0fc1f995414837b8b.stg.int.dynoquant.com
> airflow_scheduler-stdout.log: [2018-03-20 14:27:04,994] {{jobs.py:1435 ERROR
> - Cannot load the dag bag to handle failure for <TaskInstance:
> nightly_dataload.dummy_operator 2018-03-19 00:00:00 [queued]>. Setting task
> to FAILED without callbacks or retries. Do you have enough resources?}}}}
> Mysterious failures are not unexpected, because we are in the cloud, after
> all. The concern is the last line: ignoring callbacks and retries, implying
> that it's a lack of resources. However, the machine was totally underutilized
> at the time.
> I dug into this code a bit more and as far as I can tell this error is
> happening in this code path:
> [https://github.com/apache/incubator-airflow/blob/1.9.0/airflow/jobs.py#L1427]
> {{self.log.error(msg)}}
> {{try:}}
> {{ simple_dag = simple_dag_bag.get_dag(dag_id)}}
> {{ dagbag = models.DagBag(simple_dag.full_filepath)}}
> {{ dag = dagbag.get_dag(dag_id)}}
> {{ ti.task = dag.get_task(task_id)}}
> {{ ti.handle_failure(msg)}}
> {{except Exception:}}
> {{ self.log.error("Cannot load the dag bag to handle failure for %s"}}
> {{ ". Setting task to FAILED without callbacks or "}}
> {{ "retries. Do you have enough resources?", ti)}}
> {{ ti.state = State.FAILED}}
> {{ session.merge(ti)}}
> {{ session.commit()}}{{}}
> I am not very familiar with this code, nor do I have time to attach a
> debugger at the moment, but I think what is happening here is:
> * I have a factory Python file, which imports and instantiates DAG code from
> other files.
> * The scheduler loads the DAGs from the factory file on the filesystem. It
> gets a fileloc (as represented in the DB) not of the factory file, but of the
> file it loaded code from.
> * The scheduler makes a simple DAGBag from the instantiated DAGs.
> * This line of code uses the simple DAG, which references the original DAG
> object's fileloc, to create a new DAGBag object.
> * This DAGBag looks for the original DAG in the fileloc, which is the file
> containing that DAG's _code_, but is not actually importable by Airflow.
> * An exception is raised trying to load the DAG from the DAGBag, which found
> nothing.
> * Handling of the task failure never occurs.
> * The over-broad Exception code swallows all of the above occurring.
> * There's just a generic error message that is not helpful to a system
> operator.
> If this is the case, at minimum, the try/except should be rewritten to be
> more graceful and to have a better error message. But I question whether this
> level of DAGBag abstraction/indirection isn't making this failure case worse
> than it needs to be; under normal conditions the scheduler is definitely able
> to find the relevant factory-generated DAGs and execute tasks within them as
> expected, even with the fileloc set to the code path and not the import path.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)