mobuchowski commented on PR #31816:
URL: https://github.com/apache/airflow/pull/31816#issuecomment-1587419187

   @vandonr-amz this is meant to be used in [OpenLineage 
integration](https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow
   ) - in video form https://www.youtube.com/watch?v=fAqvoMzz7Tk 🙂 )
   
   In short, the idea is that operators are tracking their own lineage; 
captured lineage is then send in common format to configured backend. 
   To do that, we add methods that will be called by OpenLineage plugin, if 
it's installed or configured. It's not going to be called from user code, 
whether that's DAG or anything else. 
   
   >Also, rather than making this a method that should only be called after 
execute has set self.processing_job, which is a bit brittle, why not make the 
dependency explicit by expecting the serialized job as a parameter ? Then it 
wouldn't have to be stored in the class at all.
   This could also be a class method, the only usage of self after that is the 
log.
   
   In the integration, we can't depend on knowing what parameters particular 
operator requires. This would mean something external had to know what the 
method expect, and follow up when it changes. This is much more brittle because 
it separates responsibility between multiple separately versioned components; 
here operator always can report it's own lineage, and we can check it by tests. 
If the contract changes, tests would fail.
   
   It was our previous approach and we moved to in-operator approach for this 
reason. 
https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/sagemaker_extractors.py


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to