another related thing is cleanup of logs which was discussed a few days ago. Airflow generates enormous of logs which I like because it is very easy to troubleshot but one dag with 5 tasks i have been running for a few weeks a few times a day generated 2Gb of logs! I can probably switch logging mode to less detailed but what i really want is automatic archiving capability. For now I can just use another airflow dag to do this cleanup but it would be nice to have this feature
On Wed, Apr 5, 2017 at 11:23 PM, Vijay Krishna Ramesh < [email protected]> wrote: > To add to Siddharth's pretty extensive list (in particular, the "delete a > DAG from the code that makes up the dag bag folder, but now it shows up > with a ! icon and you have to manually set it to is_active = f" issue that > I didn't see in 1.8.0-RC4 but started seeing in 1.8.0-RC5 that became > 1.8.0) - > > how does XCOM data get cleaned up? would be nice to either let tasks > consume the data (and then it goes away from the backing db, after an ack > or something) - or at the very least, TTL after a set interval. > > > > On Wed, Apr 5, 2017 at 7:46 PM, siddharth anand <[email protected]> wrote: > > > Edgardo, > > This is a great question and something that requires functionality to > > address. As Airflow starts getting used for bigger workloads, we need a > way > > to clean up defunct resources. > > > > - How do we delete a dag and its related resources? > > - Until the recent release, the way that I stopped having a defunct > > (retired) dag show up in the UI was to move the DAG file out of the > > dag_folder or just deleting it from Git. Our dag folders are > > just symlinks > > to tagged Git repos. > > - This no longer works -- the UI will display the dag list based on > > entries in the dag table in the airflow metadata db -- but will no > > longer > > have code to back that dag table entry. I currently manually delete > > a row > > from the dag table, but that is surely not the right thing to do. > > - How do we retire entries from the *task_instance, job, log, > xcom, > > sla_miss, dag_stats, *and *dag_run* tables for dags that are > deleted? > > (I can surely clean these up manually as well, but we need a UI > > control). > > - *task_instance, job, log, &* *dag_run *tables grow faster > than > > the others > > - How does one track if variables, connections, or pools are no > > longer referenced because all of the DAGs that use them are gone? > > - It would be nice here to have reference counts & links to DAGs > > that reference a Pool, Connection, or Variable. The reference > > counts can be > > broken down into paused & unpaused. > > > > It's time we added some functionality to the API/CLI/UI to address these > > functionality gaps. > > > > -s > > > > On Tue, Apr 4, 2017 at 10:25 AM, Edgardo Vega <[email protected]> > > wrote: > > > > > Max, > > > > > > Thanks for the reply, it is much appreciated. I am currently running > > ~10k > > > task a day in our test environment. > > > > > > It is good to know where the archive point is and that I shouldn't have > > any > > > issues for a long time. > > > > > > I was just thinking ahead as we get airflow into production > environment. > > > Maybe in this case maybe way too far ahead. > > > > > > > > > Cheers, > > > > > > Edgardo > > > > > > On Tue, Apr 4, 2017 at 11:58 AM, Maxime Beauchemin < > > > [email protected]> wrote: > > > > > > > We run ~50k tasks a day at Airbnb. How many tasks/day are you > planning > > on > > > > running? > > > > > > > > Though you can archive the `task_instance` and `job` table down the > > line, > > > > but that shouldn't be a concern until you hit tens of millions of > > > entries. > > > > Then you can setup a daily Airflow job that archives some of these > > > entries. > > > > I believe we do it based on `start_date` and move rows to some other > > > table > > > > in the same db. > > > > > > > > Max > > > > > > > > On Mon, Apr 3, 2017 at 5:30 PM, Edgardo Vega <[email protected] > > > > > > wrote: > > > > > > > > > I have been playing with airflow for a few days and it's not > obvious > > > what > > > > > will happen down the road when we have lots of dags over a long > > period > > > of > > > > > time. I set a fake dag to run once a minute for a few days and > > > everything > > > > > seems okay except the graph view dropdown which works but take a > few > > > > > seconds to show up. > > > > > > > > > > Is there a way roll older data out of the system in order to clean > > > things > > > > > visually and keep the database at a smallish size? > > > > > > > > > > -- > > > > > Cheers, > > > > > > > > > > Edgardo > > > > > > > > > > > > > > > > > > > > > -- > > > Cheers, > > > > > > Edgardo > > > > > >
