mniehoff commented on issue #15588: URL: https://github.com/apache/airflow/issues/15588#issuecomment-842281159
I dug a bit deeper into it and the problem seems to be within the databricks operator (tbh: was expecting that is was rather the op than the scheduler). In comparison to e.g. BigQuery Databricks job runs do not have an id that you can set from the outside. it's getting generated as soon as a new job run is triggered. The BigQueryJobOperator creates the job run id by itself and if the operator is restarted, due to a scheduler restart it checks if a job with this id is already running and if yes, "attaches" to the running job. For Databricks this is not possible, as the job id is not configurable. A few options I see to mitigate this: 1) save the job run ID somewhere (not sure where), where it survives the scheduler restart and can be picked up by the operator. So the op could reattach to the job run. the run id would be deleted when the job has finished 2) one could always reattach if there is a job run for a given job. this would work in my case, but would not work in general, as databricks allows concurrent runs for a job. and there will definitely be a case where a job run exists and a new run should be triggered. 3) currently the operator polls. one could give the operator an async=True flag, which exits the operator after the databricks job has been started and then use a sensor to poll for the job run status (the job run is available using xcom already). Imho only 1 and 3 are feasible solutions. but I am not sure where to store the job run id, so that it survives the scheduler restart. 3) is imho the cleanest solution. Let me know what you think. I definitely aim to contribute these changes back to provider packages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
