Thanks Jacob and Daniel for your insights! I created a draft of "Airflow operators design guidelines" https://s.apache.org/airflow-operators
I've left some questions that I think should be addressed. Feel free to answer, add yours, comment, suggest and edit. I think that once we have some general idea what we as a community expect from a "good operator" we can start to think what is mergable or not. Tomek On Fri, Jul 10, 2020 at 3:28 AM Daniel Standish <[email protected]> wrote: > We should be careful not to treat every line in the docs as "constitution" > -- i.e. as a commandment. > > And in the docs, I think we would be better off if we more clearly > distinguished (1) the description of what *is* from (2) opinion about > what *should > be.* > > *This line should be chopped* > > Case in point is the line that animates this thread: *"An operator > represents a single, ideally idempotent, task."* (from here > <https://airflow.readthedocs.io/en/stable/howto/operator/index.html>) > > It's just a guess, but I suspect that this line was not meant (or voted on) > as a binding rule for the airflow project, but merely meant to serve as a > one-sentence answer to the question "what is an operator" on what is > essentially a table of contents page. > > I think we should actually remove the line. > > Normative bits should be in a normative context, and this kind of content > makes more sense in an "operator design patterns" page, or a "best > practices" section, where the merits of different patterns and reasoning > can be presented. And to the extent it is meant as a guideline or a rule > -- it's too vague to be useful. > > So I'd propose we chop it and just leave the second line: > > See the Operators Concepts > > < > https://airflow.readthedocs.io/en/stable/concepts.html#concepts-operators> > documentation > > and the Operators API Reference > > <https://airflow.readthedocs.io/en/stable/_api/index.html> for more > > information. > > > *Is idempotence ideal?* > > Incidentally, even in the context of a "best practices page" I'd argue > against the claim that *"idempotence is ideal."* First of all it needs > clarification about what it actually means. But suppose that we accept > that it means the canonical execution date pattern. Some pipelines and > tasks lend themselves to this pattern; some do not. And while it is a good > pattern where it works, it's not the only valid design pattern, it isn't > the best solution for every data problem, and therefore it doesn't make > sense to refer to it as "the ideal pattern". > > The execution_date-based idempotence pattern has special importance to > airflow but I think that in reality the average cluster will have a variety > of design patterns -- not all of them using the execution_date idempotence > pattern. And I think we should reflect that reality in our docs and > decision-making. > > *What is a single task?* > > Notwithstanding the above, regarding Tomek's question about the meaning of > "single task", I think in effect what is meant here is just "*discrete* > task" > or "unit of work" -- a unit of work that can be picked up and executed on a > worker. I don't take it as a claim about *recommended* operator scope -- > if that's what it is meant to be, it should probably be made explicitly and > in an appropriate context. > > *What an operator is, vs what is mergable* > > On another note, I think it also may be helpful to separate the question > "what is an operator" from "what kinds of operators belong in airflow". > Indeed this is another area of ambiguity in the quote above -- is it a > claim about a best practice for users, as they implement operators for > their organization? Or is it a claim about guidelines when considering > whether to merge a new operator into airflow? > > From the perspective "what is an operator", it is clear to me that an > operator is (1) not necessarily idempotent, and (2) has arbitrary scope > (i.e. re Tomek's 'what is single task' question). > > - idempotence is in general undefined because it depends entirely on how > the user defines the task. (e.g. look at any SqlOperator) > - scope is clearly arbitrary because `execute` can be implemented > arbitrarily. > > Concerning "what kinds of operators belong in airflow"... I think it's > clear that idempotence is not a requirement (because it's not in general a > thing that is determinable based on operator design alone, but depends on > usage). But, are there principles or guidelines that we should try to > adhere to, or evaluate against? There very well might be. Or, should we > try to maintain *compatibility* with a certain notion of idempotence, even > if we don't have a well-defined idempotence criteria? Maybe so >
