another related thing is cleanup of logs which was discussed a few days
ago. Airflow generates enormous of logs which I like because it is very
easy to troubleshot but one dag with 5 tasks i have been running for a few
weeks a few times a day generated 2Gb of logs! I can probably switch
logging mode to less detailed but what i really want is automatic archiving
capability. For now I can just use another airflow dag to do this cleanup
but it would be nice to have this feature

On Wed, Apr 5, 2017 at 11:23 PM, Vijay Krishna Ramesh <
[email protected]> wrote:

> To add to Siddharth's pretty extensive list (in particular, the "delete a
> DAG from the code that makes up the dag bag folder, but now it shows up
> with a ! icon and you have to manually set it to is_active = f" issue that
> I didn't see in 1.8.0-RC4 but started seeing in 1.8.0-RC5 that became
> 1.8.0) -
>
> how does XCOM data get cleaned up? would be nice to either let tasks
> consume the data (and then it goes away from the backing db, after an ack
> or something) - or at the very least, TTL after a set interval.
>
>
>
> On Wed, Apr 5, 2017 at 7:46 PM, siddharth anand <[email protected]> wrote:
>
> > Edgardo,
> > This is a great question and something that requires functionality to
> > address. As Airflow starts getting used for bigger workloads, we need a
> way
> > to clean up defunct resources.
> >
> >    - How do we delete a dag and its related resources?
> >       - Until the recent release, the way that I stopped having a defunct
> >       (retired) dag show up in the UI was to move the DAG file out of the
> >       dag_folder or just deleting it from Git. Our dag folders are
> > just symlinks
> >       to tagged Git repos.
> >       - This no longer works -- the UI will display the dag list based on
> >       entries in the dag table in the airflow metadata db -- but will no
> > longer
> >       have code to back that dag table entry. I currently manually delete
> > a row
> >       from the dag table, but that is surely not the right thing to do.
> >       - How do we retire entries from the *task_instance, job, log,
> xcom,
> >       sla_miss, dag_stats, *and *dag_run* tables for dags that are
> deleted?
> >       (I can surely clean these up manually as well, but we need a UI
> >       control).
> >          -  *task_instance, job, log, &* *dag_run *tables grow faster
> than
> >          the others
> >          - How does one track if variables, connections, or pools are no
> >       longer referenced because all of the DAGs that use them are gone?
> >          - It would be nice here to have reference counts & links to DAGs
> >          that reference a Pool, Connection, or Variable. The reference
> > counts can be
> >          broken down into paused & unpaused.
> >
> > It's time we added some functionality to the API/CLI/UI to address these
> > functionality gaps.
> >
> > -s
> >
> > On Tue, Apr 4, 2017 at 10:25 AM, Edgardo Vega <[email protected]>
> > wrote:
> >
> > > Max,
> > >
> > > Thanks for the reply, it is much appreciated.  I am currently running
> > ~10k
> > > task a day in our test environment.
> > >
> > > It is good to know where the archive point is and that I shouldn't have
> > any
> > > issues for a long time.
> > >
> > > I was just thinking ahead as we get airflow into production
> environment.
> > > Maybe in this case maybe way too far ahead.
> > >
> > >
> > > Cheers,
> > >
> > > Edgardo
> > >
> > > On Tue, Apr 4, 2017 at 11:58 AM, Maxime Beauchemin <
> > > [email protected]> wrote:
> > >
> > > > We run ~50k tasks a day at Airbnb. How many tasks/day are you
> planning
> > on
> > > > running?
> > > >
> > > > Though you can archive the `task_instance` and `job` table down the
> > line,
> > > > but that shouldn't be a concern until you hit tens of millions of
> > > entries.
> > > > Then you can setup a daily Airflow job that archives some of these
> > > entries.
> > > > I believe we do it based on `start_date` and move rows to some other
> > > table
> > > > in the same db.
> > > >
> > > > Max
> > > >
> > > > On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <[email protected]
> >
> > > > wrote:
> > > >
> > > > > I have been playing with airflow for a few days and it's not
> obvious
> > > what
> > > > > will happen down the road when we have lots of dags over a long
> > period
> > > of
> > > > > time. I set a fake dag to run once a minute for a few days and
> > > everything
> > > > > seems okay except the graph view dropdown which works but take a
> few
> > > > > seconds to show up.
> > > > >
> > > > > Is there a way roll older data out of the system in order to clean
> > > things
> > > > > visually and keep the database at a smallish size?
> > > > >
> > > > > --
> > > > > Cheers,
> > > > >
> > > > > Edgardo
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Cheers,
> > >
> > > Edgardo
> > >
> >
>

Reply via email to