ashb commented on a change in pull request #9851:
URL: https://github.com/apache/airflow/pull/9851#discussion_r456323234



##########
File path: airflow/models/dagbag.py
##########
@@ -151,10 +151,10 @@ def get_dag(self, dag_id):
                 # Load from DB if not (yet) in the bag
                 self._add_dag_from_db(dag_id=dag_id)
 
-            min_serialized_dag_update_secs = 
timedelta(seconds=settings.MIN_SERIALIZED_DAG_UPDATE_INTERVAL)
+            min_serialized_dag_fetch_secs = 
timedelta(seconds=settings.MIN_SERIALIZED_DAG_FETCH_INTERVAL)
             if (
                 dag_id in self.dags_last_changed and
-                timezone.utcnow() > self.dags_last_changed[dag_id] + 
min_serialized_dag_update_secs

Review comment:
       This logic is wrong now.
   
   We want to only go to the DB every (say) 10 seconds to check if the dag has 
updated, but this is making that decision on last_update_time.
   
   Take for example:
   
   - A dag is last updated 2 minutes ago.
   - Every time we ask for that dag, it would see that the last update is more 
than 10 seconds ago and go check the DB
   - It would then set dags_last_changed to the same 2 minute ago.
   - If the dag is then asked for again one second later, we would still find 
it was updated 2m1s ago, even though we went to the DB 1s ago. 
   
   Instead we need to store a "dags_last_fetched" which is set to `now()` 
inside _add_dag_from_db, and that is what we  make this decision on. i.e. to 
check if we should even ask the database we need a different timestamp. We 
could then further optimize the DB path, and make two queries separate:
   
   - one to get the last_updated_time.
   - Only ask for the (large) `data` column if the last_updated_time is 
different.
   
   This does mean two queries per dag to fetch when it has updated, but when it 
_hasn't_ updated (which will be the much much more common case) we don't have 
to load the (possibly huge) `data` JSON column.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to