eric-workweek opened a new issue, #67616:
URL: https://github.com/apache/airflow/issues/67616
### Under which category would you file this issue?
Providers
### Apache Airflow version
3.2.1
### What happened and how to reproduce it?
PR #66449 changed `end_time` calculation in
`DbtCloudRunJobOperator.execute()` from `time.time()` to `time.monotonic()`.
This value is serialized into the `DbtCloudRunJobTrigger`, stored in the
database, and deserialized in the triggerer process.
`time.monotonic()` returns seconds relative to an arbitrary per-system
reference point (typically boot time). When the worker and triggerer run in
different containers/pods/nodes (standard in Kubernetes/Astronomer
deployments), their monotonic clocks have different epochs. The trigger's
`end_time` comparison (`self.end_time <= time.monotonic()`) becomes meaningless
— the triggerer's `time.monotonic()` can already exceed the worker-computed
`end_time`, causing an **immediate timeout** regardless of the configured
`timeout` value (default 604800s).
**Steps to reproduce:**
1. Deploy Airflow 3.x with `apache-airflow-providers-dbt-cloud==4.9.0` in a
multi-pod environment (worker and triggerer in separate pods)
2. Use `DbtCloudRunJobOperator` with `deferrable=True` and
`wait_for_termination=True`
3. Trigger a dbt Cloud job that takes more than a few seconds
The task defers, the triggerer picks it up, and immediately emits:
```
TriggerEvent<{'status': 'error', 'message': 'Job run {run_id} has not
reached a terminal status within the configured timeout.', 'run_id': ...}>
```
Observed timeline from our logs:
- `19:03:55` — Operator checks job status (`STARTING`), computes `end_time =
time.monotonic() + 604800`, defers
- `19:05:14` — Triggerer deserializes trigger, starts `run()`
- `19:05:15` — Trigger immediately evaluates `self.end_time <=
time.monotonic()` as `True` and fires timeout
The dbt Cloud job completed successfully ~4 minutes later. Two retries
produced the same result with two wasted dbt Cloud job runs.
**Regression from:** PR #66449 (not present in 4.8.2, released before that
PR merged)
### What you think should happen instead?
The trigger should wait up to the configured timeout (default 7 days) for
the job to reach a terminal status. This worked correctly in 4.8.2, which used
`time.time()` (wall-clock epoch seconds) — a value that is consistent across
processes and nodes.
**Suggested fix:**
Revert `end_time` (and `execution_deadline`) back to `time.time()` in both
the operator (`execute()`) and trigger (`run()`). `time.monotonic()` must not
be used for values that cross process/serialization boundaries.
```python
# operators/dbt.py — execute()
end_time = time.time() + self.timeout # not time.monotonic()
# triggers/dbt.py — run()
if self.end_time <= time.time(): # not time.monotonic()
```
### Operating System
Astronomer Runtime (containerized)
### Deployment
Astronomer
### Apache Airflow Provider(s)
dbt-cloud
### Versions of Apache Airflow Providers
apache-airflow-providers-dbt-cloud==4.9.0
### Official Helm Chart version
Not Applicable
### Kubernetes Version
Not Applicable
### Helm Chart configuration
Not Applicable
### Docker Image customizations
Not Applicable
### Anything else?
**Workaround:** Pin to `apache-airflow-providers-dbt-cloud==4.8.2`
PR #67360 (included in 4.9.0) partially addressed the regression by
reverting `defer(timeout=...)` back to `None`, but did not fix the
`time.monotonic()` usage for `end_time` in the trigger's `run()` loop.
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]