I updated the document now - with some "structure" and some
clarifications/context on what we want to achieve by writing the
document - I have not yet thought/elaborated on particular scenarios
in there, but I think having such "context" "aim" for the document and
general structure might help with hashing out the details.

Please take a look and let me know what you think.

J


On Fri, Jul 10, 2020 at 2:55 PM Tomasz Urbaszek <[email protected]> wrote:
>
> Thanks Jacob and Daniel for your insights! I created a draft of "Airflow
> operators design guidelines" https://s.apache.org/airflow-operators
>
> I've left some questions that I think should be addressed. Feel free to
> answer, add yours, comment, suggest and edit. I think that once we have
> some general idea what we as a community expect from a "good operator" we
> can start to think what is mergable or not.
>
> Tomek
>
> On Fri, Jul 10, 2020 at 3:28 AM Daniel Standish <[email protected]>
> wrote:
>
> > We should be careful not to treat every line in the docs as "constitution"
> > -- i.e. as a commandment.
> >
> > And in the docs, I think we would be better off if we more clearly
> > distinguished (1) the description of what *is* from (2) opinion about
> > what *should
> > be.*
> >
> > *This line should be chopped*
> >
> > Case in point is the line that animates this thread: *"An operator
> > represents a single, ideally idempotent, task."*  (from here
> > <https://airflow.readthedocs.io/en/stable/howto/operator/index.html>)
> >
> > It's just a guess, but I suspect that this line was not meant (or voted on)
> > as a binding rule for the airflow project, but merely meant to serve as a
> > one-sentence answer to the question "what is an operator" on what is
> > essentially a table of contents page.
> >
> > I think we should actually remove the line.
> >
> > Normative bits should be in a normative context, and this kind of content
> > makes more sense in an "operator design patterns" page, or a "best
> > practices" section, where the merits of different patterns and reasoning
> > can be presented.  And to the extent it is meant as a guideline or a rule
> > -- it's too vague to be useful.
> >
> > So I'd propose we chop it and just leave the second line:
> >
> > See the Operators Concepts
> > > <
> > https://airflow.readthedocs.io/en/stable/concepts.html#concepts-operators>
> > documentation
> > > and the Operators API Reference
> > > <https://airflow.readthedocs.io/en/stable/_api/index.html> for more
> > > information.
> >
> >
> > *Is idempotence ideal?*
> >
> > Incidentally, even in the context of a "best practices page" I'd argue
> > against the claim that *"idempotence is ideal."*  First of all it needs
> > clarification about what it actually means.  But suppose that we accept
> > that it means the canonical execution date pattern.  Some pipelines and
> > tasks lend themselves to this pattern; some do not.  And while it is a good
> > pattern where it works, it's not the only valid design pattern, it isn't
> > the best solution for every data problem, and therefore it doesn't make
> > sense to refer to it as "the ideal pattern".
> >
> > The execution_date-based idempotence pattern has special importance to
> > airflow but I think that in reality the average cluster will have a variety
> > of design patterns -- not all of them using the execution_date idempotence
> > pattern.  And I think we should reflect that reality in our docs and
> > decision-making.
> >
> > *What is a single task?*
> >
> > Notwithstanding the above, regarding Tomek's question about the meaning of
> > "single task", I think in effect what is meant here is just "*discrete*
> > task"
> > or "unit of work" -- a unit of work that can be picked up and executed on a
> > worker.  I don't take it as a claim about *recommended* operator scope --
> > if that's what it is meant to be, it should probably be made explicitly and
> > in an appropriate context.
> >
> > *What an operator is, vs what is mergable*
> >
> > On another note, I think it also may be helpful to separate the question
> > "what is an operator" from "what kinds of operators belong in airflow".
> > Indeed this is another area of ambiguity in the quote above -- is it a
> > claim about a best practice for users, as they implement operators for
> > their organization?  Or is it a claim about guidelines when considering
> > whether to merge a new operator into airflow?
> >
> > From the perspective "what is an operator", it is clear to me that an
> > operator is (1) not necessarily idempotent, and (2) has arbitrary scope
> > (i.e. re Tomek's 'what is single task' question).
> >
> >    - idempotence is in general undefined because it depends entirely on how
> >    the user defines the task. (e.g. look at any SqlOperator)
> >    - scope is clearly arbitrary because `execute` can be implemented
> >    arbitrarily.
> >
> > Concerning "what kinds of operators belong in airflow"...  I think it's
> > clear that idempotence is not a requirement (because it's not in general a
> > thing that is determinable based on operator design alone, but depends on
> > usage).   But, are there principles or guidelines that we should try to
> > adhere to, or evaluate against?  There very well might be.  Or, should we
> > try to maintain *compatibility* with a certain notion of idempotence, even
> > if we don't have a well-defined idempotence criteria?  Maybe so
> >



-- 

Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129

Reply via email to