ms32035 commented on pull request #13199:
URL: https://github.com/apache/airflow/pull/13199#issuecomment-811802961


   @ashb this is not helpful at all.
   
   - You are not providing any (measurable) arguments as to why implementing an 
additional structure storing the dependencies, which would add to the scheduler 
overhead regardless if it is used or not, is a better solution than 
recalculating a clean state  in memory, with some time based caching, when the 
functionality is actually used. Maintaining an additional database table also 
adds a number of edge cases that would require handling and are not an issue 
for the current approach, such as dependencies pointing to invalid dags (these 
are texts) or tags or tasks being removed. Furthermore, there is no reason the 
scheduler needs to know/use these dependencies at all. Also a tricky question 
here. The same supposedly heavy `collect_dags_from_db` function is called in 
the web server by `/refresh_all` endpoint.
   - requirement 2) defies the purpose of the view, and I’ve written my 
arguments before
   
   To make it clear, the main structure behind the view is a dictionary which 
will be no more than kilobytes for deployments with thousands of dags, so 
storing that in memory is not an issue at all.
   
   One other idea I have is to add a check based on 
`max(serialized_dag.last_updated)` to check if recalculating is needed.
   
   I’d really appreciate more support to progress this, otherwise I’ll close 
the PR, leave it us a plugin, and potentially abandon in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to