kacpermuda commented on PR #63499:
URL: https://github.com/apache/airflow/pull/63499#issuecomment-4068688463
Thanks for working on this - the use case makes sense, and having a way to
limit hook-level lineage emission can definitely be useful.
One thing I’d like to raise before we settle on the exact shape of the
solution is that hook-level lineage is fundamentally a core feature, but the
places where we will likely want to control it live in multiple providers (S3,
GCS, etc.). Because of that, it would be good to think a bit about how to keep
it consistent across providers without introducing a dependency on newer core
versions.
In particular:
1. Global vs local control
It probably makes sense to have a cluster-level/default configuration,
and then allow providers or hooks to override it when needed. That way DAG
authors do not need to explicitly configure every hook if a cluster operator
wants to disable or limit hook-level lineage globally. We can already
effectively disable hook-level lineage since AF 3.2 by setting the asset
collection limit to 0 (PR #62010), but maybe we should consider introducing a
more explicit config (for example a simple boolean to enable/disable HLL), and
then allow hooks/providers to override it if needed.
2. Consistency across providers without a core dependency
This is a core feature that we will likely want to control in multiple
providers, so we probably want to keep naming and semantics consistent. At the
same time, we would like to do this without introducing a dependency on core,
so that providers can implement it while still supporting multiple Airflow
versions (basically so that what we introduce today still works on AF2.11).
It might be good to discuss this with the more task-sdk oriented
contributors as well, as it feels more like an API boundary question between
core and providers. One possible idea could be to keep an internal _BaseHook in
core, while also exposing a BaseHook in a common provider layer, so that
_BaseHook remains an internal implementation detail and BaseHook can be used
more freely by providers. cc @ashb , curious what you think.
Because of that, it might be worth having a short discussion (either here or
on devlist) about the intended approach. Maybe we end up doing exactly what
this PR proposes, but it would be good to confirm that this is the direction we
want and that we apply it consistently across providers.
Curious what others think. cc @eladkal @potiuk @mobuchowski
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]