argibbs commented on issue #33410:
URL: https://github.com/apache/airflow/issues/33410#issuecomment-1706964333

   Ok,
   
   Have managed to get to a real computer with a keyboard and everything and 
have a proper look.
   
   First up, my apologies for breaking something, that was obviously not the 
intent.
   
   So, comments:
   1. The proposed change `self._add_paths_to_queue([request.full_filepath], 
True)` is unfortunately problematic ... this would effectively undo the fix I 
made in the first place. You can read my PR + associated for more details, but 
basically SLA callbacks can fire so often that the file queue never drains, and 
the airflow scheduler effectively stops processing changes to dag files.
   2. So... what to do next? I'm still thinking on that one (but if you work it 
out, don't wait for me). There are a couple of options ... I don't yet 
understand how zipped dags are handled differently to standard dags. Assuming 
they're not too different, then I would say the fix is either:
      i) add the contents of the zips to the file path queue like we do normal 
dags (that's in `prepare_file_path_queue` at approx like 1200). Quite what this 
involves I don't know, I've never used zipped dags, and I'd have to experiment 
locally as part of the fix.
      ii) go further with my SLA fix. I did originally have a more involved 
solution (again if you look at my PR history, you can see what it was) that 
split the queues and effectively tracked which files had been processed. Then 
even if the SLA callbacks were spamming the queue, it wouldn't matter, because 
we'd ensure any dags which hadn't been refreshed after X seconds (by default I 
think it's 30 seconds or maybe 60) would then get priority and would be 
refreshed. SLA callback spam would make the system less responsive to updates, 
but it wouldn't stop ignoring them altogether. This change was originally 
abandoned because it was too big a change and people were understandably 
nervous, but maybe we could try again with as minimal an impl as possible. If 
we did that, then the proposed fix by @tseruga would be fine.
   
   That said, while I think the impl in (2) would work, it does feel like one 
of those fix-the-symptoms-not-the-cause type things, and (1) is probably 
cleaner, for some value of clean.
   
   As mentioned before, I will probably have an opportunity to look at this in 
October maybe. But no promises - if you can come up with a good solution before 
then, go for it. I'm happy to weigh in on PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to