Thanks Jacob and Daniel for your insights! I created a draft of "Airflow
operators design guidelines" https://s.apache.org/airflow-operators

I've left some questions that I think should be addressed. Feel free to
answer, add yours, comment, suggest and edit. I think that once we have
some general idea what we as a community expect from a "good operator" we
can start to think what is mergable or not.

Tomek

On Fri, Jul 10, 2020 at 3:28 AM Daniel Standish <[email protected]>
wrote:

> We should be careful not to treat every line in the docs as "constitution"
> -- i.e. as a commandment.
>
> And in the docs, I think we would be better off if we more clearly
> distinguished (1) the description of what *is* from (2) opinion about
> what *should
> be.*
>
> *This line should be chopped*
>
> Case in point is the line that animates this thread: *"An operator
> represents a single, ideally idempotent, task."*  (from here
> <https://airflow.readthedocs.io/en/stable/howto/operator/index.html>)
>
> It's just a guess, but I suspect that this line was not meant (or voted on)
> as a binding rule for the airflow project, but merely meant to serve as a
> one-sentence answer to the question "what is an operator" on what is
> essentially a table of contents page.
>
> I think we should actually remove the line.
>
> Normative bits should be in a normative context, and this kind of content
> makes more sense in an "operator design patterns" page, or a "best
> practices" section, where the merits of different patterns and reasoning
> can be presented.  And to the extent it is meant as a guideline or a rule
> -- it's too vague to be useful.
>
> So I'd propose we chop it and just leave the second line:
>
> See the Operators Concepts
> > <
> https://airflow.readthedocs.io/en/stable/concepts.html#concepts-operators>
> documentation
> > and the Operators API Reference
> > <https://airflow.readthedocs.io/en/stable/_api/index.html> for more
> > information.
>
>
> *Is idempotence ideal?*
>
> Incidentally, even in the context of a "best practices page" I'd argue
> against the claim that *"idempotence is ideal."*  First of all it needs
> clarification about what it actually means.  But suppose that we accept
> that it means the canonical execution date pattern.  Some pipelines and
> tasks lend themselves to this pattern; some do not.  And while it is a good
> pattern where it works, it's not the only valid design pattern, it isn't
> the best solution for every data problem, and therefore it doesn't make
> sense to refer to it as "the ideal pattern".
>
> The execution_date-based idempotence pattern has special importance to
> airflow but I think that in reality the average cluster will have a variety
> of design patterns -- not all of them using the execution_date idempotence
> pattern.  And I think we should reflect that reality in our docs and
> decision-making.
>
> *What is a single task?*
>
> Notwithstanding the above, regarding Tomek's question about the meaning of
> "single task", I think in effect what is meant here is just "*discrete*
> task"
> or "unit of work" -- a unit of work that can be picked up and executed on a
> worker.  I don't take it as a claim about *recommended* operator scope --
> if that's what it is meant to be, it should probably be made explicitly and
> in an appropriate context.
>
> *What an operator is, vs what is mergable*
>
> On another note, I think it also may be helpful to separate the question
> "what is an operator" from "what kinds of operators belong in airflow".
> Indeed this is another area of ambiguity in the quote above -- is it a
> claim about a best practice for users, as they implement operators for
> their organization?  Or is it a claim about guidelines when considering
> whether to merge a new operator into airflow?
>
> From the perspective "what is an operator", it is clear to me that an
> operator is (1) not necessarily idempotent, and (2) has arbitrary scope
> (i.e. re Tomek's 'what is single task' question).
>
>    - idempotence is in general undefined because it depends entirely on how
>    the user defines the task. (e.g. look at any SqlOperator)
>    - scope is clearly arbitrary because `execute` can be implemented
>    arbitrarily.
>
> Concerning "what kinds of operators belong in airflow"...  I think it's
> clear that idempotence is not a requirement (because it's not in general a
> thing that is determinable based on operator design alone, but depends on
> usage).   But, are there principles or guidelines that we should try to
> adhere to, or evaluate against?  There very well might be.  Or, should we
> try to maintain *compatibility* with a certain notion of idempotence, even
> if we don't have a well-defined idempotence criteria?  Maybe so
>

Reply via email to