mobuchowski commented on PR #31816: URL: https://github.com/apache/airflow/pull/31816#issuecomment-1587419187
@vandonr-amz this is meant to be used in [OpenLineage integration](https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-53+OpenLineage+in+Airflow ) - in video form https://www.youtube.com/watch?v=fAqvoMzz7Tk 🙂 ) In short, the idea is that operators are tracking their own lineage; captured lineage is then send in common format to configured backend. To do that, we add methods that will be called by OpenLineage plugin, if it's installed or configured. It's not going to be called from user code, whether that's DAG or anything else. >Also, rather than making this a method that should only be called after execute has set self.processing_job, which is a bit brittle, why not make the dependency explicit by expecting the serialized job as a parameter ? Then it wouldn't have to be stored in the class at all. This could also be a class method, the only usage of self after that is the log. In the integration, we can't depend on knowing what parameters particular operator requires. This would mean something external had to know what the method expect, and follow up when it changes. This is much more brittle because it separates responsibility between multiple separately versioned components; here operator always can report it's own lineage, and we can check it by tests. If the contract changes, tests would fail. It was our previous approach and we moved to in-operator approach for this reason. https://github.com/OpenLineage/OpenLineage/blob/main/integration/airflow/openlineage/airflow/extractors/sagemaker_extractors.py -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
