Thanks Daniel - really helpful!
I understand the "statefullnes" case - it does allow "reuse" of hooks.
And while it is not a "must" - that's a noble goal on it's own.:)
Now - for the triggers - I would love to understand more and what you
wrote let me understand where the problem is in detail :).
Let me explain my understanding in my own words (because I think it's
not all obvious also for those who never wrote a Deferrable Operator
nor followed the AIP/specification):
* currently when we defer an operator the Operator object itself is
not serialized - when we defer an operator, we simply tell Airflow:
"when you resume the operator, create it same way DAG parsing does it,
and call "this" (chosen) method with "those" args provided.
* this means that we should not rely on any state stored in the
operator. We can only rely on whatever the "__init__" method of the
operator creates as "self" attributes (and those should be basically
immutable - we should not rely on any state of those as they will be
always recrated when resuming)
* so, effectively there are two ways we can make Hooks
"deferrable-operator-friendly":
1) We create the Hook in __init__ as self. attribute - but Hook
should be stateless and should not access metadata db (retrieve
Connection for example) in it's __init__.. This is the "Databricks"
Hook approach
2) Or we should always recreate the hook in any "resumable" method
* same limitations apply for any "self." attribute in any deferrable operator
Is that the right assessment?
If so - that potentially changes the current (not codified but used
all over the code) way of treating Hooks. They were commonly
instantiated in "execute()" method.
Just to give an example we have GoogleBaseHook that does this in the __init__:
self.extras = self.get_connection(self.gcp_conn_id).extra_dejson
This basically means that we cannot create the Hook in the operator's
__init__ (database access). So none of the GoogleHooks are compatible
with the approach 1).
Currently all "execute()" methods in all the Google Operators look similarly to:
def execute(self, context) -> None:
hook = GCSHook(
gcp_conn_id=self.gcp_conn_id,
impersonation_chain=self.impersonation_chain,
)
And as consequence - if we want to make any Google Operator
deferrable, we have to make sure Hooks are recreated as the first step
in every method that can be used as the "resume" entrypoint.
Did I get it right?
If so - I wholeheartedly agree serializing Hooks is NOT something we want to do.
But I think we just need to agree, codify (and maybe re-implement
some) on how to follow those (new) limitations and "best practices"
for all Hooks. Even if they are not used in "deferrable" context -
because we likely want more of our Operators to be deferrable as soon
as possible.
So, I have a few questions which I think I do not have answers yet:
* Should we accept both 1) and 2) approaches as "good practice" ? Or
should we prefer the 1) method? The 2) method has this annoying "you
have to remember to initialize hooks in every resumable method" -
which is not nice and difficult to debug I think.
* Should we write some "Hooks writing best practices"
https://airflow.apache.org/docs/apache-airflow/stable/howto/custom-operator.html?highlight=hooks#hooks
- explaining why and how Hooks should behave (with reference to
deferrable operators to explain why?). I am happy to propose something
if we agree on the "scope of it"
* Should we try to detect and flag (during our tests) the Hooks that
are doing things wrongly (in this new context of course). We could do
it as part of the AIP-44
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API?src=contextnavpagetreemode
where similar test harness will be needed anyway (and is proposed).
J
On Tue, Dec 28, 2021 at 8:20 PM Daniel Standish
<[email protected]> wrote:
>
> "Statefulness" of hooks
>
> One part of this discussion was this issue I created concerning the snowflake
> hook. My point there is that we should not have side effects and
> "statefulness" in a hook without good reason.
>
> In that case when running a query, in addition to returning the query ids, we
> store them on the instance attribute `query_ids`. This could certainly make
> sense to do in an operator e.g. to facilitate behavior of on_kill. But in a
> hook in general I don't think there's a good reason, for one because hooks
> aren't run themselves, but always within a task -- either a python function
> or an operator, in which state can be managed.
>
> I'm aware of another case, namely SubprocessHook, where there is some
> "statefulness". In that hook we store the subprocess object as a side
> effect in `run_command`. This was sort of necessary because the attribute is
> only known once inside a context manager, and storing it allows us to refer
> to it in on_kill. But to avoid this "statefulness" in the hook we could have
> returned a runner object that did essentially the same thing as what the hook
> now does.
>
> The general principle motivating my concern here is (1) to avoid side effects
> generally (least surprise) and (2) that having this kind of statefulness of
> the hook has the potential to unnecessarily limit its usage e.g. from run
> multiple queries concurrently or run multiple subprocesses or whatever it is.
> My concern isn't so much that there's a specific issue with regard to
> deferrable operators, but that a hook should be permissive in the sense that
> it can be reused for multiple operations where it makes sense and not do side
> effects without good reason.
>
> Hook params with triggers
>
> Separately though, deferrable operators _do_ present an interesting challenge
> to the way we manage hooks. Commonly with hooks we like to allow the user
> options for configuration, e.g. you can let the airflow Connection object
> completely configure the hook, _or_ you can override some aspects of the hook
> behavior with hook init params.
>
> E.g. maybe the connection says database=hello and in hook you override that
> with database=world.
>
> In hooks where we have a lot of init params this has the potential to create
> messy Trigger serialization if e.g. we have to include in the Trigger init
> every hook param. And maybe this is what you were alluding to Jarek with
> "non-serializable" fields of hooks? The implication seems to be that we
> would perhaps "serialize" the hook and attach that to the trigger. It's an
> interesting idea. Another approach here is perhaps to adopt the convention
> that Triggers should have for example `conn_id` and `hook_params:dict=None`
> rather than inlining all the hook params (which would not be very
> future-proof i.e. would require changes to trigger serialization whenever
> hook params change).
>
>