Re: [PR] doc: add more instructions for `up_for_retry` [airflow]

via GitHub Wed, 13 Mar 2024 01:40:33 -0700


potiuk commented on code in PR #38100:
URL: https://github.com/apache/airflow/pull/38100#discussion_r1522762382



##########
docs/apache-airflow/authoring-and-scheduling/deferring.rst:
##########
@@ -257,3 +257,24 @@ In Airflow, sensors wait for specific conditions to be met 
before proceeding wit
 | Built-in functionality for rescheduling                |  Requires custom 
logic to defer task and handle        |
 |                                                        |  external changes   
                                   |
 
+--------------------------------------------------------+--------------------------------------------------------+
+
+Difference between ``up_for_retry`` and ``deferred`` state
+-------------------------------------------------------------------
+
+In Airflow, operators that's in `up_for_retry` state will still take worker 
slots, becasue the process still runs and does ``sleep`` there. Deferral 
Operators 
+
+In Airflow when a operator is in the ``up_for_retry`` state, it essentially 
means the operator is waiting to be retried after a failure, but it does not 
release its resources. The process remains alive, keeping its memory, sockets, 
and other resources allocated, except for the CPU. The ``deferred`` state, 
utilized only by Deferrable Operators, offers a more sophisticated approach to 
handling wait conditions. Deferrable Operators serialize and store the task's 
state, freeing all resources. When a condition is met, the task is deserialized 
and resumes operation, optimizing resource use by not holding onto resources 
during wait periods.
+
++--------------------------------------------------------+--------------------------------------------------------+
+|           state='up_for_retry'                         |          
state='deferred'                              |
++========================================================+========================================================+
+| Keeps resources while waiting.                         |  Releases 
resources, pauses execution when idle,       |

Review Comment:
   Yes. I think @CongyueZhang - our earlier discussion could a little 
misleading (and I realized I was talking about a different retry than you were) 
so it's great that you proposed the documentation here. 
   
   There are several things that you can call "retry" in Airflow:
   
   1) Retry done by waiting and retrying by the task itself in a loop - where 
the task simply performs retry of  a certain operation and `sleeps` - usually 
done by tenacity or other mechanisms like that -> this is what I referred to 
when it comes to retrying something that is in progress.
   
   2) Retry that results it task "failing" and having retry count 
(`up_for_retry`)
   
   Here @dirrao explained - the task effectively fails, the slot is freed and 
resources are freed as well. But In this case the state of the task in progrees 
is not saved.  Whatever the task had done so far is lost and when retrying the 
whole task needs to be restarted from the beginning. Waiting can only be done 
on a "time" base - and the task will restart from the beginning when retry time 
passes - and will redo the job from the beginning. While taks is in 
`up_for_retry` state - indeed resources are not used, but also when you retry 
the task, you need to re-do what was done the first time you attempted to do 
the previous time, because we do not keep the state of the originally failed 
task. This MIGHT lead to increased resource usage because every time the task 
attempts to re-run will have to effectively do the same "preparation" (whatever 
the preparation is).
   
   3) Deferring is a mechanism where you can defer the task and serialize it's 
state to a disk and let Triggerer do the conditional wait. Which means that 
effectively your task remains in `half-done` state and the state of doing it is 
preserved and you can efffectively resume where you left off - because when the 
condition is met (this might be time-based or possibly waiting for external job 
completion or other async-io compatible conditoin) the state of the task is 
restored from the disk and it resumes with the state restored to what it was 
before it deferred.
   
   So in your progression, roughly speaking:
   
   1) takes resources while waiting but initialization is done only once
   2) retrying causes excessive resource use for redoing initialization part of 
the task (for tasks that need reinitialization) but waiting does not keep 
resources
   3) the state is preserved between deferrals and intitlalization is done only 
once - so it roughly combines the benefits of 1) and 2) - with an extra 
overhead of triggerer that does waiting for potentially 1000s of such deferred 
tasks.
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] doc: add more instructions for `up_for_retry` [airflow]

Reply via email to