potiuk commented on a change in pull request #9850:
URL: https://github.com/apache/airflow/pull/9850#discussion_r455781724
##########
File path: airflow/models/serialized_dag.py
##########
@@ -94,13 +97,26 @@ def write_dag(cls, dag: DAG, min_update_interval:
Optional[int] = None, session=
(timezone.utcnow() -
timedelta(seconds=min_update_interval)) < cls.last_updated))
).scalar():
return
- log.debug("Writing DAG: %s to the DB", dag.dag_id)
- session.merge(cls(dag))
+
+ log.debug("Checking if DAG (%s) is updated or same", dag.dag_id)
+ serialized_dag_from_db: SerializedDagModel = (
+ session
+ .query(cls)
+ .filter(cls.dag_id == dag.dag_id)
+ .one_or_none()
+ )
Review comment:
In the AirBnB talk they mentioned about really big DAGs and the need to
use compression in DAG serialization DB (they implemented it in their fork). I
think adding one extra read here might add quite heavy DB throughput. It's an
edge case, but possibly an important one and some companies already
experiencing it.
I know it would be a bit bigger change (involving DB migration) but possibly
we should calculate hash of the DAG and compare it and read it from the DB
rather than then DAG itself. That seems like perfectly doable and not that
complex.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]