We should be careful not to treat every line in the docs as "constitution"
-- i.e. as a commandment.

And in the docs, I think we would be better off if we more clearly
distinguished (1) the description of what *is* from (2) opinion about
what *should
be.*

*This line should be chopped*

Case in point is the line that animates this thread: *"An operator
represents a single, ideally idempotent, task."*  (from here
<https://airflow.readthedocs.io/en/stable/howto/operator/index.html>)

It's just a guess, but I suspect that this line was not meant (or voted on)
as a binding rule for the airflow project, but merely meant to serve as a
one-sentence answer to the question "what is an operator" on what is
essentially a table of contents page.

I think we should actually remove the line.

Normative bits should be in a normative context, and this kind of content
makes more sense in an "operator design patterns" page, or a "best
practices" section, where the merits of different patterns and reasoning
can be presented.  And to the extent it is meant as a guideline or a rule
-- it's too vague to be useful.

So I'd propose we chop it and just leave the second line:

See the Operators Concepts
> <https://airflow.readthedocs.io/en/stable/concepts.html#concepts-operators> 
> documentation
> and the Operators API Reference
> <https://airflow.readthedocs.io/en/stable/_api/index.html> for more
> information.


*Is idempotence ideal?*

Incidentally, even in the context of a "best practices page" I'd argue
against the claim that *"idempotence is ideal."*  First of all it needs
clarification about what it actually means.  But suppose that we accept
that it means the canonical execution date pattern.  Some pipelines and
tasks lend themselves to this pattern; some do not.  And while it is a good
pattern where it works, it's not the only valid design pattern, it isn't
the best solution for every data problem, and therefore it doesn't make
sense to refer to it as "the ideal pattern".

The execution_date-based idempotence pattern has special importance to
airflow but I think that in reality the average cluster will have a variety
of design patterns -- not all of them using the execution_date idempotence
pattern.  And I think we should reflect that reality in our docs and
decision-making.

*What is a single task?*

Notwithstanding the above, regarding Tomek's question about the meaning of
"single task", I think in effect what is meant here is just "*discrete* task"
or "unit of work" -- a unit of work that can be picked up and executed on a
worker.  I don't take it as a claim about *recommended* operator scope --
if that's what it is meant to be, it should probably be made explicitly and
in an appropriate context.

*What an operator is, vs what is mergable*

On another note, I think it also may be helpful to separate the question
"what is an operator" from "what kinds of operators belong in airflow".
Indeed this is another area of ambiguity in the quote above -- is it a
claim about a best practice for users, as they implement operators for
their organization?  Or is it a claim about guidelines when considering
whether to merge a new operator into airflow?

>From the perspective "what is an operator", it is clear to me that an
operator is (1) not necessarily idempotent, and (2) has arbitrary scope
(i.e. re Tomek's 'what is single task' question).

   - idempotence is in general undefined because it depends entirely on how
   the user defines the task. (e.g. look at any SqlOperator)
   - scope is clearly arbitrary because `execute` can be implemented
   arbitrarily.

Concerning "what kinds of operators belong in airflow"...  I think it's
clear that idempotence is not a requirement (because it's not in general a
thing that is determinable based on operator design alone, but depends on
usage).   But, are there principles or guidelines that we should try to
adhere to, or evaluate against?  There very well might be.  Or, should we
try to maintain *compatibility* with a certain notion of idempotence, even
if we don't have a well-defined idempotence criteria?  Maybe so

Reply via email to