kacpermuda commented on PR #63499:
URL: https://github.com/apache/airflow/pull/63499#issuecomment-4068688463

   Thanks for working on this - the use case makes sense, and having a way to 
limit hook-level lineage emission can definitely be useful.
   
   One thing I’d like to raise before we settle on the exact shape of the 
solution is that hook-level lineage is fundamentally a core feature, but the 
places where we will likely want to control it live in multiple providers (S3, 
GCS, etc.). Because of that, it would be good to think a bit about how to keep 
it consistent across providers without introducing a dependency on newer core 
versions.
   
   In particular:
   
   1. Global vs local control  
      It probably makes sense to have a cluster-level/default configuration, 
and then allow providers or hooks to override it when needed. That way DAG 
authors do not need to explicitly configure every hook if a cluster operator 
wants to disable or limit hook-level lineage globally.  We can already 
effectively disable hook-level lineage since AF 3.2 by setting the asset 
collection limit to 0 (PR #62010), but maybe we should consider introducing a 
more explicit config (for example a simple boolean to enable/disable HLL), and 
then allow hooks/providers to override it if needed.
   
   2. Consistency across providers without a core dependency  
      This is a core feature that we will likely want to control in multiple 
providers, so we probably want to keep naming and semantics consistent. At the 
same time, we would like to do this without introducing a dependency on core, 
so that providers can implement it while still supporting multiple Airflow 
versions (basically so that what we introduce today still works on AF2.11).  
      It might be good to discuss this with the more task-sdk oriented 
contributors as well, as it feels more like an API boundary question between 
core and providers. One possible idea could be to keep an internal _BaseHook in 
core, while also exposing a BaseHook in a common provider layer, so that 
_BaseHook remains an internal implementation detail and BaseHook can be used 
more freely by providers. cc @ashb , curious what you think.
   
   Because of that, it might be worth having a short discussion (either here or 
on devlist) about the intended approach. Maybe we end up doing exactly what 
this PR proposes, but it would be good to confirm that this is the direction we 
want and that we apply it consistently across providers. 
   
   Curious what others think. cc @eladkal @potiuk @mobuchowski 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to