Very curious to follow this discussion! I think there's been debate about this even internal to airflow for what we should support for XComm regarding idempotency. A while back there had been some previous discussions on this and lack of consensus reverted #6370 <https://github.com/apache/airflow/pull/6370> killed the idea in PR #6210 <https://github.com/apache/airflow/pull/6210>. Some interesting threads to review about idempotency and XComm here <https://github.com/apache/airflow/pull/6210#discussion_r335593800> and here <https://github.com/apache/airflow/pull/6370#issuecomment-546579924>.
I'm by no means an expert on this but I personally might suggest the working definition: "By the end of its lifetime, an airflow task should be authoritative on target state of the state which that task modifies regardless of previous task runs" Or stated less obtusely: "The state of universe that airflow task can modify should be deterministic by the end of a task run, irrespective of state changes due to previous task runs" This captures the spirit of idempotency by focusing on how an airflow task affects the state of the universe rather than the implementation details of if one task run affects the execution path of another task run. This definition allows for "create X resource if not exists; otherwise (re)-attach to the state of existing resource X logic" that you describe for dataproc cluster creation / BQ job creation. The need for this sort of behavior in airflow extends to being able to re-attach to any long running task (Dataflow Job, Spark Job, Hive Query Job, etc, etc.). This is sort of possible with a SubmitJobOperator and PollJobSensor but critically misses the ability to retry the job submit on poke indicating a (retriable) failed state of the job (e.g. job fails because inputs from some upstream dependency (not managed by airflow e.g. file drop from 3rd party vendor) don't exists yet) . A side note, I personally think that in general DELETING a resource that does not match the desired state of the operator is potentially dangerous and should always be a configurable behavior (kudos for doing this in dataproc PR!). On Thu, Jul 9, 2020 at 7:25 AM Jarek Potiuk <[email protected]> wrote: > All for it. I think misunderstandings and assumptions on what "idempotency" > really means in the context of Airlfow Tasks has bitten us more than once. > I'd love to help with working out the right definition (and it's not > straightforward). I will have to give it quite a bit thinking to get some > of the corner cases and "guidelines" on them hashed out. > > On Tue, Jul 7, 2020 at 12:55 PM Tomasz Urbaszek <[email protected]> > wrote: > > > Hello everyone, > > > > The plenty of integrations with external services a.k.a operators is > > one of the bigest advantages of Airflow. As documentation states: > > "An operator represents a single, ideally idempotent, task. " > > > > The idempotence - I think - is the key to create a usable operator. It > > assures that we can run backfills and use fewer resources. The problem > > is that there's no official Airflow definition of idempotence. Or at > > least I'm not aware of any. > > > > What do I mean by "Airflow definition"? By this, I mean a guide or > > recipe for making an operator idempotent including the limits of > > real-world idempotency. > > > > The reason for bringing this topic are those two PRs: > > - https://github.com/apache/airflow/pull/9593 which improves creating > > Dataproc cluster (create, if exists check state, if wrong then delete > > and wait and then create new one) > > -https://github.com/apache/airflow/pull/9590 improving BigQuery insert > > job idempotency (submit, if job_id exists check state, if running/ok > > reattach, if failed then generate new job_id, submit) > > > > Both PRs implements suggestions from our users and solve real, > > production-grade problems. Both do this in a non-perfect way because > > each of those operators tries to tackle with variety of idempotence > > problems. This requires some custom logic that has to work with > > non-deterministic situations (i.e. Dataproc and unknown time of > > deleting cluster). And that makes me wonder what is the exact > > definition of "single, ideally idempotent, task"? > > > > Operators should answer users' needs - there's no question to that. > > But it is the community that will have to maintain the operators. And > > maintinaing complex logic which is hard (or nearly impossible) to test > > in e2e way is not a pleasent task. > > > > What I would like to ask you is: > > - what does it mean for you that the operator is idempotent? > > - what does it mean "single task"? Does it mean a single event or > > operation (set of events)? > > > > By doing this I would like to work on a set of how-to rules for > > designing the logic of `execute` method. I would like to encourage you > > to share your experiences with desiging and working with complex > > operators :) > > > > Hope you are good, > > Tomek > > > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48%20660%20796%20129> <+48660796129 > <+48%20660%20796%20129>> > [image: Polidea] <https://www.polidea.com/> > -- *Jacob Ferriero* Strategic Cloud Engineer: Data Engineering [email protected] 617-714-2509
