bujji8411 commented on PR #60385: URL: https://github.com/apache/airflow/pull/60385#issuecomment-3788026431
Concrete use case and why `depends_on_past` is insufficient Thanks for the feedback, let me clarify the concrete use case and why `depends_on_past` does not fit here. ### Pipeline scenario: This is a stateful file-based ingestion workflow: `extract → validate_count → stp → archive → cleanup` Key characteristics: 1. The same source file name is used on every run (upstream system creates/overwrites it) 2. The file must not be re-processed until the previous run has fully completed 3. Partial success leaves the system in an inconsistent external state ### Failure scenario **dag_run_1** ``` extract → success (file created) validate_count → success stp → failed archive, cleanup → not executed External state after failure ``` **_Key points:_** 1. Extracted file still exists with the same name 2. Data may be partially processed 3. File is neither archived nor cleaned up 4. Why depends_on_past does not work If we set: **stp.depends_on_past = True** Then in **dag_run_2:** > > Only stp is blocked > extract and validate_count will still run > This re-extracts or overwrites the same file and breaks idempotency What we actually need is: > Start extract in dag_run_2 only if > extract, validate_count, stp, archive, and cleanup > in dag_run_1 all succeeded. This is a group-level previous-run dependency, not a single-task dependency. **Summary** 1. `depends_on_past` handles task-local idempotency 2. It does not handle external state correctness 3. The requested feature is a declarative alternative to custom sensors 4. The behavior is opt-in and backward-compatible -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
