potiuk commented on code in PR #40916:
URL: https://github.com/apache/airflow/pull/40916#discussion_r1688513049
##########
airflow/dag_processing/processor.py:
##########
@@ -837,13 +877,17 @@ def process_file(
:return: number of dags found, count of import errors, last number of
db queries
"""
self.log.info("Processing file %s for tasks to queue", file_path)
+ try:
+ if InternalApiConfig.get_use_internal_api():
+ dagbag = DagFileProcessor._get_dagbag(file_path)
+ else:
+ with create_session() as session:
Review Comment:
All right @jscheffl and @vincbeck . I think I have it right - I will do some
more manual testing and see if it works in various combinations. Looking for
your comments.
But the things I found that the counter was LIKELY not doing what it was
supposed to do when added in https://github.com/apache/airflow/pull/40323
@MaksYermak @kosteev @VladaZakharova -> this is probably last "serious"
change to make AIP-44 work (still special tests and docs will be needed but
this can be added after that). Let me know if I understood it properly.
When I looked at the counter from #40323 the event was added to the session
just before creating dagbag from the file - apparently with the intention to
cound the DB queries done inside the dagbag parsing. But as far as I
uniderstend (please correct me if I am wrong) ... It would not work because the
session the event has been registered for is not passed down/used in the dagbag
parsing - any DB access there would create a new session rather than use the
session created in "process_file" - so it would not count any queries there.
But it was indeed counting potentially queries run during callback, updating
warning etc.
I **THINK** that was not the intention of that change - the intention was to
count queries resuting from parsing the DAG (but it will not). I am not sure if
we can easily fix it (but it does not matter in DB isolation mode at least
because no DB queries will be allowed when parsing the DAG anyway).
In this change I attempted to retain that behaviour. Due to the fact that we
cannot pass parsed dag to Internal-api component, the whole "process_file" is
now effectively split into more transactons - 4 instead of 2 as it was
originally (but I think that is ok, transactional integrity is not important in
those cases - and I collect the count of queries and add them up from all 4
calls that might turn into "internal_api_call" - so DB queries will be counted
in remote end and passed back as return value of those calls. So the metrics
should remain the same as it was.
I introduced a context manager for that - that is nicer than the original
method decorated. Speaking of which I also found one potential memory leak. I
think when you register for the events, you should always remove the event
function because otherwise it might leak memory (see the docstring). It would
not matter much in this case - because parsing file is always done in a fork,
but I did it in the way that context manager cleans up and removes such
registered event - in "database isolation" case it would matter because there
memory leaks could stay in the gunicorn threads handling the calls.
Let me know if the lines of thought above make sense please :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]