potiuk commented on code in PR #40916:
URL: https://github.com/apache/airflow/pull/40916#discussion_r1688513049


##########
airflow/dag_processing/processor.py:
##########
@@ -837,13 +877,17 @@ def process_file(
         :return: number of dags found, count of import errors, last number of 
db queries
         """
         self.log.info("Processing file %s for tasks to queue", file_path)
+        try:
+            if InternalApiConfig.get_use_internal_api():
+                dagbag = DagFileProcessor._get_dagbag(file_path)
+            else:
+                with create_session() as session:

Review Comment:
   All right @jscheffl and @vincbeck . I think I have it right - I will do some 
more manual testing and see if it works in various combinations. Looking for 
your comments.
   
   But the things I found that the counter was LIKELY not doing what it was 
supposed to do when added in https://github.com/apache/airflow/pull/40323 
   
   @MaksYermak @kosteev @VladaZakharova -> this is probably last "serious" 
change to make AIP-44 work (still special tests and docs will be needed but 
this can be added after that). Let me know if I understood it properly.
   
   When I looked at the counter from #40323 the event was added to the session 
just before creating dagbag from the file - apparently with the intention to 
cound the DB queries done inside the dagbag parsing. But as far as I 
uniderstend (please correct me if I am wrong) ... It would not work because the 
session the event has been registered for is not passed down/used in the dagbag 
parsing - any DB access there would create a new session rather than use the 
session created in "process_file" - so it would not count any queries there. 
But it was indeed counting potentially queries run during callback, updating 
warning etc.
   
   I **THINK** that was not the intention of that change - the intention was to 
count queries resuting from parsing the DAG (but it will not). I am not sure if 
we can easily fix it (but it does not matter in DB isolation mode at least 
because no DB queries will be allowed when parsing the DAG anyway). 
   
   In this change I attempted to retain that behaviour. Due to the fact that we 
cannot pass parsed dag to Internal-api component, the whole "process_file" is 
now effectively split into more transactons - 4 instead of 2 as it was 
originally (but I think that is ok, transactional integrity is not important in 
those cases - and I collect the count of queries and add them up from all 4 
calls that might turn into "internal_api_call" - so DB queries will be counted 
in remote end and passed back as return value of those calls. So the metrics 
should remain the same as it was.
   
   I introduced a context manager for that - that is nicer than the original 
method decorated. Speaking of which I also found one potential memory leak. I 
think when you register for the events, you should always remove the event 
function because otherwise it might leak memory (see the docstring). It would 
not matter much in this case - because parsing file is always done in a fork, 
but I did it in the way that context manager cleans up and removes such 
registered event - in "database isolation" case it would matter because there 
memory leaks could stay in the gunicorn threads handling the calls. 
   
   Let me know if the lines of thought above make sense please :)  
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to