Hi all,
TLDR: what’s our approach for handling offline migrations for cases where the options are either deleting some likely non-critical data or spending lots of time on implementing, testing and maintaining the offline implementation? Do we know if users actually use offline migrations? Some context: I worked on a complex migration [1] related to Deadline Alerts a few months ago. This was needed to ensure that users don’t lose the archive of missed deadlines and deadlines for incomplete Dag runs during migration from 3.1 to 3.2. Another migration[2] will need a similar but not as complex change to be able to migrate the on_[success/failure]_callbacks for incomplete Dag runs (those callbacks are deleted from DB once Dag runs are completed anyway). For both of these, especially the former, I decided to implement them purely in SQLA/Python, without using any direct SQL statements because (and let me know if any of my assumptions are incorrect): * Some of the logic gets very unreadable when implemented in SQL * Slightly different SQL statements are required for each type of DB we support * As far as I know, we don’t have a good way to run automated testing on these migrations, especially the edge cases relevant for each migration. Running manual testing with lots of edge cases is very time-consuming. * The size of the table for each of those migrations is expected to be small enough to not cause significant slowdown because of iterating through rows in python. I’ve described what each of the migrations would be responsible for above and made educated guesses about how users use them and run migrations. However, I don’t have data to support this. Basically, I’ve made a small compromise on performance to aim for correctness and robustness of these migrations by only using SQLA and iterating in Python. Later, when my PR was almost green in CI, I found out through a failing test that this approach would not work with offline migration because the Python script would not be able to fetch existing rows from the database to iterate through them. Fully supporting offline migration in this case would require re-implementation in SQL. However, I found some old migrations that also don’t fully support offline migrations and copied their approach of outputting a large warning saying that offline migration is not supported for that change. I also added an SQL statement to delete all the rows for the relevant table which would be needed for the CI to pass, and for the users to run the migration in offline mode and expect airflow to work without DB errors once the migration is complete. So, I’m wondering what the best approach would be for this migration specifically before we ship 3.2 and also what the best practice should be moving forward: * Keep it as is: it would output a large warning in the offline migration script and delete the rows needed to proceed * Raise an error if the user tries to run the offline migration for this change so that the user is forced to read and understand what is happening and what they need to do (delete rows from the given table) to proceed * Something interactive where the user has to confirm if they’re ok with deleting those rows * Re-implement the migration in multiple SQL scripts Long term, I believe an easier and more automated way to test edge cases in the dataset during migration would be very helpful and improve the robustness. I also wonder if we have any anecdotes/data of the offline migrations actually being used. [1] https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/migrations/versions/0092_3_2_0_replace_deadline_inline_callback_with_fkey.py [2] https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/migrations/versions/0091_3_2_0_restructure_callback_table.py Best, Ramit
