argibbs commented on issue #33410:
URL: https://github.com/apache/airflow/issues/33410#issuecomment-1706964333
Ok,
Have managed to get to a real computer with a keyboard and everything and
have a proper look.
First up, my apologies for breaking something, that was obviously not the
intent.
So, comments:
1. The proposed change `self._add_paths_to_queue([request.full_filepath],
True)` is unfortunately problematic ... this would effectively undo the fix I
made in the first place. You can read my PR + associated for more details, but
basically SLA callbacks can fire so often that the file queue never drains, and
the airflow scheduler effectively stops processing changes to dag files.
2. So... what to do next? I'm still thinking on that one (but if you work it
out, don't wait for me). There are a couple of options ... I don't yet
understand how zipped dags are handled differently to standard dags. Assuming
they're not too different, then I would say the fix is either:
i) add the contents of the zips to the file path queue like we do normal
dags (that's in `prepare_file_path_queue` at approx like 1200). Quite what this
involves I don't know, I've never used zipped dags, and I'd have to experiment
locally as part of the fix.
ii) go further with my SLA fix. I did originally have a more involved
solution (again if you look at my PR history, you can see what it was) that
split the queues and effectively tracked which files had been processed. Then
even if the SLA callbacks were spamming the queue, it wouldn't matter, because
we'd ensure any dags which hadn't been refreshed after X seconds (by default I
think it's 30 seconds or maybe 60) would then get priority and would be
refreshed. SLA callback spam would make the system less responsive to updates,
but it wouldn't stop ignoring them altogether. This change was originally
abandoned because it was too big a change and people were understandably
nervous, but maybe we could try again with as minimal an impl as possible. If
we did that, then the proposed fix by @tseruga would be fine.
That said, while I think the impl in (2) would work, it does feel like one
of those fix-the-symptoms-not-the-cause type things, and (1) is probably
cleaner, for some value of clean.
As mentioned before, I will probably have an opportunity to look at this in
October maybe. But no promises - if you can come up with a good solution before
then, go for it. I'm happy to weigh in on PRs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]