mniehoff opened a new issue #15588:
URL: https://github.com/apache/airflow/issues/15588
**Apache Airflow version**: 2.0.0
**Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
**Environment**:
- **Cloud provider or hardware configuration**: Hosted version of astronomer
cloud
- **OS** (e.g. from /etc/os-release):
- **Kernel** (e.g. `uname -a`):
- **Install tools**:
- **Others**:
**What happened**:
- Using the DatabricksRunNowOperator to start a job in Databricks
- Scheduler get's restarted (e.g. by deploying to astronomer or because the
scheduler lost the connection to the database)
- The Databricks task gets marked as failed, directly after the scheduler
comes up. No exception visible, both in the task log and in the scheduler log.
Just `"ERROR - Marking run <DagRun ...., externally triggered: False> failed"`
- The Databricks task gets started again: "Starting attempt 2 of 1"
- Operator tries to start a new Databricks job. This fails as
concurrent_runs is set to 1 on Databricks side. The second task try is also
marked as failed on airflow.
- The databricks job is still running, but no airflow tasking is polling for
the status
**What you expected to happen**:
The existing tasks keep on running after the scheduler has been restarted
and continues to poll for the databricks job status.
**How to reproduce it**:
See "What happened" above. A simple DAG with a DatabricksOperator and then
restarting the scheduler works for us to reproduce the issue.
**Anything else we need to know**:
The problem occurs deterministically when the scheduler is being restarted
and a databricks operator is running.
Last output of the first task:
```
[2021-04-29 06:44:13,509] {databricks.py:90} INFO - import_raw_and_process
in run state: {'life_cycle_state': 'RUNNING', 'result_state': None,
'state_message': 'In run'}
[2021-04-29 06:44:13,510] {databricks.py:91} INFO - View run status, Spark
UI, and logs at https://xxxxxxxxxx.cloud.databricks.com#job/xx/run/xxxx
[2021-04-29 06:44:13,510] {databricks.py:92} INFO - Sleeping for 60 seconds.
```
Output of the second task:
<details><summary>log of the second task</summary>
```
[2021-04-29 06:44:37,071] {taskinstance.py:852} INFO - Dependencies all met
for <TaskInstance:task_id 2021-04-28T04:05:00+00:00 [queued]>
[2021-04-29 06:44:37,094] {taskinstance.py:852} INFO - Dependencies all met
for <TaskInstance:task_id 2021-04-28T04:05:00+00:00 [queued]>
[2021-04-29 06:44:37,094] {taskinstance.py:1043} INFO -
--------------------------------------------------------------------------------
[2021-04-29 06:44:37,094] {taskinstance.py:1044} INFO - Starting attempt 2
of 1
[2021-04-29 06:44:37,094] {taskinstance.py:1045} INFO -
--------------------------------------------------------------------------------
[2021-04-29 06:44:37,163] {taskinstance.py:1064} INFO - Executing
<Task(DatabricksRunNowOperator): import_raw_and_process> on
2021-04-28T04:05:00+00:00
[2021-04-29 06:44:37,168] {standard_task_runner.py:52} INFO - Started
process 214 to run task
[2021-04-29 06:44:37,173] {standard_task_runner.py:76} INFO - Running:
['airflow', 'tasks', 'run', 'xxx.prod.integration', 'import_raw_and_process',
'2021-04-28T04:05:00+00:00', '--job-id', '21291', '--pool', 'default_pool',
'--raw', '--subdir', 'DAGS_FOLDER/xxx/prod/integration.py', '--cfg-path',
'/tmp/tmpyneez1sp', '--error-file', '/tmp/tmp6cuo7gkp']
[2021-04-29 06:44:37,175] {standard_task_runner.py:77} INFO - Job 21291:
Subtask import_raw_and_process
[2021-04-29 06:44:37,272] {logging_mixin.py:103} INFO - Running
<TaskInstance:task_id 2021-04-28T04:05:00+00:00 [running]> on host
spherical-antenna-8596-scheduler-55f666944b-xfwqg
[2021-04-29 06:44:37,392] {taskinstance.py:1258} INFO - Exporting the
following env vars:
AIRFLOW_CTX_DAG_EMAIL=xxxx
AIRFLOW_CTX_DAG_OWNER=datascience
AIRFLOW_CTX_DAG_ID=xxx.prod.integration
AIRFLOW_CTX_TASK_ID=import_raw_and_process
AIRFLOW_CTX_EXECUTION_DATE=2021-04-28T04:05:00+00:00
AIRFLOW_CTX_DAG_RUN_ID=scheduled__2021-04-28T04:05:00+00:00
[2021-04-29 06:44:37,395] {base.py:74} INFO - Using connection to: id:
databricks_default. Host: https://xxxx.cloud.databricks.com, Port: None,
Schema: , Login: token, Password: None, extra: XXXXXXXX
[2021-04-29 06:44:37,396] {databricks.py:170} INFO - Using token auth.
[2021-04-29 06:44:37,964] {databricks.py:73} INFO - Run submitted with
run_id: 106466
[2021-04-29 06:44:37,965] {databricks.py:170} INFO - Using token auth.
[2021-04-29 06:44:38,286] {databricks.py:78} INFO - View run status, Spark
UI, and logs at https://xxxx.cloud.databricks.com#job/xx/run/xxxx
[2021-04-29 06:44:38,287] {databricks.py:170} INFO - Using token auth.
[2021-04-29 06:44:38,690] {taskinstance.py:1457} ERROR - Task failed with
exception
[2021-04-29 06:44:38,693] {taskinstance.py:1507} INFO - Marking task as
FAILED. dag_id=xxx.prod.integration, task_id=import_raw_and_process,
execution_date=20210428T040500, start_date=20210429T064437,
end_date=20210429T064438
[2021-04-29 06:44:38,711] {email.py:184} INFO - Email alerting: attempt 1
[2021-04-29 06:44:39,054] {email.py:196} INFO - Sent an alert email to
['xxxx']
[2021-04-29 06:44:41,250] {local_task_job.py:142} INFO - Task exited with
return code 1
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]