Hi all,

TLDR: what’s our approach for handling offline migrations for cases where the 
options are either deleting some likely non-critical data or spending lots of 
time on implementing, testing and maintaining the offline implementation? Do we 
know if users actually use offline migrations?


Some context: I worked on a complex migration [1] related to Deadline Alerts a 
few months ago. This was needed to ensure that users don’t lose the archive of 
missed deadlines and deadlines for incomplete Dag runs during migration from 
3.1 to 3.2. Another migration[2] will need a similar but not as complex change 
to be able to migrate the on_[success/failure]_callbacks for incomplete Dag 
runs (those callbacks are deleted from DB once Dag runs are completed anyway). 
For both of these, especially the former, I decided to implement them purely in 
SQLA/Python, without using any direct SQL statements because (and let me know 
if any of my assumptions are incorrect):


  *
Some of the logic gets very unreadable when implemented in SQL
  *
Slightly different SQL statements are required for each type of DB we support
  *
As far as I know, we don’t have a good way to run automated testing on these 
migrations, especially the edge cases relevant for each migration. Running 
manual testing with lots of edge cases is very time-consuming.
  *
The size of the table for each of those migrations is expected to be small 
enough to not cause significant slowdown because of iterating through rows in 
python. I’ve described what each of the migrations would be responsible for 
above and made educated guesses about how users use them and run migrations. 
However, I don’t have data to support this.


Basically, I’ve made a small compromise on performance to aim for correctness 
and robustness of these migrations by only using SQLA and iterating in Python. 
Later, when my PR was almost green in CI, I found out through a failing test 
that this approach would not work with offline migration because the Python 
script would not be able to fetch existing rows from the database to iterate 
through them. Fully supporting offline migration in this case would require 
re-implementation in SQL. However, I found some old migrations that also don’t 
fully support offline migrations and copied their approach of outputting a 
large warning saying that offline migration is not supported for that change. I 
also added an SQL statement to delete all the rows for the relevant table which 
would be needed for the CI to pass, and for the users to run the migration in 
offline mode and expect airflow to work without DB errors once the migration is 
complete. So, I’m wondering what the best approach would be for this migration 
specifically before we ship 3.2 and also what the best practice should be 
moving forward:


  *
Keep it as is: it would output a large warning in the offline migration script 
and delete the rows needed to proceed
  *
Raise an error if the user tries to run the offline migration for this change 
so that the user is forced to read and understand what is happening and what 
they need to do (delete rows from the given table) to proceed
  *
Something interactive where the user has to confirm if they’re ok with deleting 
those rows
  *
Re-implement the migration in multiple SQL scripts


Long term, I believe an easier and more automated way to test edge cases in the 
dataset during migration would be very helpful and improve the robustness. I 
also wonder if we have any anecdotes/data of the offline migrations actually 
being used.


[1] 
https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/migrations/versions/0092_3_2_0_replace_deadline_inline_callback_with_fkey.py
[2] 
https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/migrations/versions/0091_3_2_0_restructure_callback_table.py

Best,
Ramit

Reply via email to