Re: [DISCUSS] Managing airflow database size and retention

Jarek Potiuk Mon, 29 Dec 2025 11:24:15 -0800

I have a feeling that making those **two** tables partition-friendly should
be easy - and maybe that's all that we need. That would make it possible
for you to use the partitioning, (where it would not be necessary for
everyone).
There might be a few ways of doing it without losing uniqueness. For
example Job ID and audit log ID could  follow certain conventions - and
always start with a partition key) thus providing uniqueness we need.


Those two tables are pretty specific and neither job id nor log id impact
anything else in our data model.

I think such a change to make those two tables "partition friendly" could
be accepted, but we could only say it after seeing a POC.

J.


On Mon, Dec 29, 2025 at 8:01 PM <[email protected]> wrote:

> I do think that the second option is best, it is also what we wanted to
> do, the only reason we did not do that is because from our tests, sql
> alchemy sometimes breaks as it expects certain constraints which are not
> there, mainly for update queries, select works well, if I am not mistaken,
> there are 3 or 4 large tables, job being the largest (7 times larger than
> the second largest), the question is, will such a pull request be approved?
>
> As we do lose the unique constraint (as it becomes per partition), though
> it is a sequence that most likely won't repeat until the previous has been
> deleted, but if not, we might query unrelated job or log data, and so
> changes in the api server are also needed, creating the pr is not a
> problem, the question is how beneficial will it be, as if it is done to
> those problematic tables, it means that the preferred way to manage
> retention is from the db, and can be an optional alembic script.
>
> I do not want to just push a fix that will add more complexity than the
> benefit it will bring.
>
> If we do go the pr way, a separate discussion is probably needed to decide
> how it should be done (most likely an additional airflow command to turn on
> or off the partitions and retention)
>
> Doing it as a custom solution has caused problems with sqlalchemy, and we
> do not want to do so as if later an alembic script relies on the primary
> key in some way, we will need to fix it manually and deal with the problems
> it may cause when we update.
>
> > On 29 Dec 2025, at 21:36, Jarek Potiuk <[email protected]> wrote:
> >
> > Another option is that you ( could make ONLY for those two tables -
> > partition-friendly.  And do not partition anything else.  I think that
> > **could** be possible - both have datetime fields that could be used as
> > partition keys - you would have to assess if you can do it as your
> "custom"
> > solution or whether it would require some changes to airflow models. But
> I
> > can't see foreign key problems if ONLY those two tables are partitioned,
> so
> > likely you could do it yourself in your DB.
> >
> > In this case - maybe "let's solve those tables that are problematic" is
> > easier to do than "let's apply partitioning to everything".
> >
> >> On Mon, Dec 29, 2025 at 7:31 PM Jarek Potiuk <[email protected]> wrote:
> >>
> >> Purely theoretically, you could change the log and job tables to be
> >> unlogged - and thus avoid WAL for them.
> >>
> >> The drawback of this:
> >>
> >> * the data in those tables will be lost if you pull the plug or kill -9
> >> the primary postgres server
> >> * the tables are not available (at all) in replicas - so in case of a
> >> fallback, you would have to have a manual data import/export for those
> >> tables on fail-over, rather than rely on replicas being "ready to
> fallback
> >> immediately". Or accept data loss.
> >> * the data from those tables will not be present in backups
> >>
> >> I am not 100% sure, but I believe loss of data in both tables is not
> >> really catastrophic for Airflow, so maybe it's acceptable risk (but
> likely
> >> you should do a disaster-recovery test to see what happens and how to
> >> recover in case, indeed, someone pulls the plug on your postgres server.
> >>
> >> J,
> >>
> >>
> >>
> >>
> >>> On Mon, Dec 29, 2025 at 7:16 PM Natanel <[email protected]>
> wrote:
> >>>
> >>> Yes, the problem is the manual deletions, we have tried it, it
> resulted in
> >>> the same exact issue, as the scheduled db procedures which clean up the
> >>> rows marked as deleted actually get deleted, and so it takes up
> storage,
> >>> yet it does not solve the WAL problem, the problematic table is
> actually
> >>> not task_instance, it is relatively small, the log and job tables are
> the
> >>> biggest tables (the problem with them is the primary key change
> required),
> >>> by a multiple of 10 (or more, cluster dependant).
> >>>
> >>> The smaller batches might solve the issue, however, it seems to just
> delay
> >>> the problem a little rather than solve it, as deleting data with a
> delete
> >>> query (especially a lot of data) is not a very light operation, and so
> I
> >>> think that this is the main issue.
> >>>
> >>> It would be nice if we could use partitions instead, as it is a lighter
> >>> operation, and does not require us to maintain a query and manage our
> db,
> >>> I
> >>> have thought about changing the models, most of the changes are
> relatively
> >>> simple, for some it is just removing the foreign key and relying on ORM
> >>> level constraints, for others, it requires adding a pre query to have
> the
> >>> same constrains but I do not like that idea, maybe there is another
> way to
> >>> make airflow "partition-friendly"?
> >>>
> >>> I can't think of a nice way to do so, maybe it does not exist, as the
> db
> >>> clean is as simple as a delete query gets, yet when there is a lot of
> >>> data,
> >>> it is all duplicated in WALs.
> >>>
> >>> On Mon, Dec 29, 2025, 19:40 Daniel Standish via dev <
> >>> [email protected]>
> >>> wrote:
> >>>
> >>>> Have you looked at doing manual deletions?  I.e. writing your own sql?
> >>>>
> >>>> The db clean command is probably not "optimal" for all scenarios.
> >>>>
> >>>> So for example, if the main problem table for you is task_instance,
> you
> >>>> could periodically delete TI records in smaller batches using some
> >>>> appropriate index (whether it exists now or you add it).  Then maybe
> you
> >>>> would not stress the db as hard.
> >>>>
> >>>> Airflow isn't designed to use partitions so, you may not get good
> >>> results
> >>>> with that approach.
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Dec 29, 2025 at 7:32 AM Natanel <[email protected]>
> >>> wrote:
> >>>>
> >>>>> Hello everyone, after having issues with the 'airflow db clean'
> >>> command,
> >>>>> where due to the amount of dags and tasks that are running every day
> >>> in
> >>>> our
> >>>>> deployments, we get a lot of new data every day, which is stored in
> >>> the
> >>>>> database, and when we delete the data, due to the way PGSQL works,
> the
> >>>>> WAL's get replicated to both the archive storage and the main data
> >>>> storage
> >>>>> of the db instance, which in turn, causes a significant jump in cpu
> >>>> usage,
> >>>>> ram usage and disk usage, whenever we run the command, which causes
> >>> all
> >>>>> kinds of issues, we even had it once fill up the db storage, and
> >>> causing
> >>>>> the database to be unresponsive, forcing us to move to our backup
> >>>> database,
> >>>>> after we haven't ran the command for a few months due to human error.
> >>>>>
> >>>>> As of now, I know that this is the accepted and widely used way of
> >>>> managing
> >>>>> the airflow database's size, however, we noticed that it may cause
> >>> issues
> >>>>> in certain cases, just like in our case, where if the db has not been
> >>>>> cleaned up for a while, cleaning it can be problematic.
> >>>>>
> >>>>> We decided to try and partition the table, and use pgsql's built in
> >>>>> retention of partitions, which does not issue a DELETE query, and is
> >>>>> lighter and faster, while being simpler to use, however, we have
> >>>>> encountered issues due to having Foreign Key constraints in some
> >>> tables,
> >>>>> having to duplicate such keys and other than forcing code changes (as
> >>> the
> >>>>> foreign key must include the partitioned key, as the partitioned key
> >>> must
> >>>>> be part of the primary key), while also having the issue of
> sqlalchemy
> >>>>> breaking once we change the primary key, with the addition of the
> >>>>> constraints on the primary key breaking.
> >>>>>
> >>>>> And in Mysql, due to the foreign keys, it is not possible to
> partition
> >>>>> tables which include them, as it is not supported yet (according to
> >>> this
> >>>>> <
> >>>>>
> >>>>
> >>>
> https://dev.mysql.com/doc/refman/8.4/en/partitioning-limitations-storage-engines.html
> >>>>>>
> >>>>> ).
> >>>>>
> >>>>> Has anyone else tried to use the databases built in partition
> >>> retention
> >>>>> system instead of the 'airflow db clean' command?
> >>>>>
> >>>>> Thanks, Natanel.
> >>>>>
> >>>>
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [DISCUSS] Managing airflow database size and retention

Reply via email to