I updated the document now - with some "structure" and some clarifications/context on what we want to achieve by writing the document - I have not yet thought/elaborated on particular scenarios in there, but I think having such "context" "aim" for the document and general structure might help with hashing out the details.
Please take a look and let me know what you think. J On Fri, Jul 10, 2020 at 2:55 PM Tomasz Urbaszek <[email protected]> wrote: > > Thanks Jacob and Daniel for your insights! I created a draft of "Airflow > operators design guidelines" https://s.apache.org/airflow-operators > > I've left some questions that I think should be addressed. Feel free to > answer, add yours, comment, suggest and edit. I think that once we have > some general idea what we as a community expect from a "good operator" we > can start to think what is mergable or not. > > Tomek > > On Fri, Jul 10, 2020 at 3:28 AM Daniel Standish <[email protected]> > wrote: > > > We should be careful not to treat every line in the docs as "constitution" > > -- i.e. as a commandment. > > > > And in the docs, I think we would be better off if we more clearly > > distinguished (1) the description of what *is* from (2) opinion about > > what *should > > be.* > > > > *This line should be chopped* > > > > Case in point is the line that animates this thread: *"An operator > > represents a single, ideally idempotent, task."* (from here > > <https://airflow.readthedocs.io/en/stable/howto/operator/index.html>) > > > > It's just a guess, but I suspect that this line was not meant (or voted on) > > as a binding rule for the airflow project, but merely meant to serve as a > > one-sentence answer to the question "what is an operator" on what is > > essentially a table of contents page. > > > > I think we should actually remove the line. > > > > Normative bits should be in a normative context, and this kind of content > > makes more sense in an "operator design patterns" page, or a "best > > practices" section, where the merits of different patterns and reasoning > > can be presented. And to the extent it is meant as a guideline or a rule > > -- it's too vague to be useful. > > > > So I'd propose we chop it and just leave the second line: > > > > See the Operators Concepts > > > < > > https://airflow.readthedocs.io/en/stable/concepts.html#concepts-operators> > > documentation > > > and the Operators API Reference > > > <https://airflow.readthedocs.io/en/stable/_api/index.html> for more > > > information. > > > > > > *Is idempotence ideal?* > > > > Incidentally, even in the context of a "best practices page" I'd argue > > against the claim that *"idempotence is ideal."* First of all it needs > > clarification about what it actually means. But suppose that we accept > > that it means the canonical execution date pattern. Some pipelines and > > tasks lend themselves to this pattern; some do not. And while it is a good > > pattern where it works, it's not the only valid design pattern, it isn't > > the best solution for every data problem, and therefore it doesn't make > > sense to refer to it as "the ideal pattern". > > > > The execution_date-based idempotence pattern has special importance to > > airflow but I think that in reality the average cluster will have a variety > > of design patterns -- not all of them using the execution_date idempotence > > pattern. And I think we should reflect that reality in our docs and > > decision-making. > > > > *What is a single task?* > > > > Notwithstanding the above, regarding Tomek's question about the meaning of > > "single task", I think in effect what is meant here is just "*discrete* > > task" > > or "unit of work" -- a unit of work that can be picked up and executed on a > > worker. I don't take it as a claim about *recommended* operator scope -- > > if that's what it is meant to be, it should probably be made explicitly and > > in an appropriate context. > > > > *What an operator is, vs what is mergable* > > > > On another note, I think it also may be helpful to separate the question > > "what is an operator" from "what kinds of operators belong in airflow". > > Indeed this is another area of ambiguity in the quote above -- is it a > > claim about a best practice for users, as they implement operators for > > their organization? Or is it a claim about guidelines when considering > > whether to merge a new operator into airflow? > > > > From the perspective "what is an operator", it is clear to me that an > > operator is (1) not necessarily idempotent, and (2) has arbitrary scope > > (i.e. re Tomek's 'what is single task' question). > > > > - idempotence is in general undefined because it depends entirely on how > > the user defines the task. (e.g. look at any SqlOperator) > > - scope is clearly arbitrary because `execute` can be implemented > > arbitrarily. > > > > Concerning "what kinds of operators belong in airflow"... I think it's > > clear that idempotence is not a requirement (because it's not in general a > > thing that is determinable based on operator design alone, but depends on > > usage). But, are there principles or guidelines that we should try to > > adhere to, or evaluate against? There very well might be. Or, should we > > try to maintain *compatibility* with a certain notion of idempotence, even > > if we don't have a well-defined idempotence criteria? Maybe so > > -- Jarek Potiuk Polidea | Principal Software Engineer M: +48 660 796 129
