Edgardo,
This is a great question and something that requires functionality to
address. As Airflow starts getting used for bigger workloads, we need a way
to clean up defunct resources.

   - How do we delete a dag and its related resources?
      - Until the recent release, the way that I stopped having a defunct
      (retired) dag show up in the UI was to move the DAG file out of the
      dag_folder or just deleting it from Git. Our dag folders are
just symlinks
      to tagged Git repos.
      - This no longer works -- the UI will display the dag list based on
      entries in the dag table in the airflow metadata db -- but will no longer
      have code to back that dag table entry. I currently manually delete a row
      from the dag table, but that is surely not the right thing to do.
      - How do we retire entries from the *task_instance, job, log,  xcom,
      sla_miss, dag_stats, *and *dag_run* tables for dags that are deleted?
      (I can surely clean these up manually as well, but we need a UI
      control).
         -  *task_instance, job, log, &* *dag_run *tables grow faster than
         the others
         - How does one track if variables, connections, or pools are no
      longer referenced because all of the DAGs that use them are gone?
         - It would be nice here to have reference counts & links to DAGs
         that reference a Pool, Connection, or Variable. The reference
counts can be
         broken down into paused & unpaused.

It's time we added some functionality to the API/CLI/UI to address these
functionality gaps.

-s

On Tue, Apr 4, 2017 at 10:25 AM, Edgardo Vega <[email protected]>
wrote:

> Max,
>
> Thanks for the reply, it is much appreciated.  I am currently running ~10k
> task a day in our test environment.
>
> It is good to know where the archive point is and that I shouldn't have any
> issues for a long time.
>
> I was just thinking ahead as we get airflow into production environment.
> Maybe in this case maybe way too far ahead.
>
>
> Cheers,
>
> Edgardo
>
> On Tue, Apr 4, 2017 at 11:58 AM, Maxime Beauchemin <
> [email protected]> wrote:
>
> > We run ~50k tasks a day at Airbnb. How many tasks/day are you planning on
> > running?
> >
> > Though you can archive the `task_instance` and `job` table down the line,
> > but that shouldn't be a concern until you hit tens of millions of
> entries.
> > Then you can setup a daily Airflow job that archives some of these
> entries.
> > I believe we do it based on `start_date` and move rows to some other
> table
> > in the same db.
> >
> > Max
> >
> > On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <[email protected]>
> > wrote:
> >
> > > I have been playing with airflow for a few days and it's not obvious
> what
> > > will happen down the road when we have lots of dags over a long period
> of
> > > time. I set a fake dag to run once a minute for a few days and
> everything
> > > seems okay except the graph view dropdown which works but take a few
> > > seconds to show up.
> > >
> > > Is there a way roll older data out of the system in order to clean
> things
> > > visually and keep the database at a smallish size?
> > >
> > > --
> > > Cheers,
> > >
> > > Edgardo
> > >
> >
>
>
>
> --
> Cheers,
>
> Edgardo
>

Reply via email to