Hi Gerard, I like your examples but compared to the first sections of your document, that last section felt a bit rushed. By looking at actual example and your comments above (the issues you discovered) I was able to comprehend but for people new to Airflow it might be a bit confusing. Why you did not document these pitfalls/issues right in the doc? All of them are very valid point and would save time for other people.
The big thing for me is exactly this: a better strategy to process a large backfill if the desired schedule is 1 day. Processing 700+ days is going to take a lot of time and overhead when processing per month is an option. Is a duplicate of the DAG with a different interval a better choice, or are there strategies to detect this in an operator and use the output of that to specify the date window boundaries? Personally the more I think about it, the more I am inclined to use new Sid's feature (only run latest) and store ETL timestamps outside of Airflow - so very much traditional incremental ETL process. I am also reading that many people stay away from backfills. Something else for you consider for your document is to describe typical support scenarios once you have your pipeline running: 1) retries of tasks / Dags (not possible to retry entire DAG right now I think) 2) reload of already loaded data (exact steps to do that) 3) typical troubleshooting steps On Sat, Oct 22, 2016 at 7:07 PM, Gerard Toonstra <[email protected]> wrote: > Hi all, > > So I worked out a full pipeline for a toy data warehouse on postgres: > > https://gtoonstra.github.io/etl-with-airflow/fullexample.html > > https://github.com/gtoonstra/etl-with-airflow/tree/master/ > examples/full-example > > It demonstrates pretty much all listed principles for ETL work except for > alerting and monitoring. > Just some work TBD on the DDL and a full code review on naming conventions. > > Things I ran into: > - Issue 137, max_active_runs doesn't work after clearing tasks, it does in > the very first run. > - parameters for standard PostgresqlOperator are not templated, so couldn't > use the core operator. > - it's a good idea to specify "depends_on_past" when using sensors, > otherwise sensors could > exhaust available processing slots. > - a better strategy to process a large backfill if the desired schedule is > 1 day. Processing 700+ > days is going to take a lot of time and overhead when processing per > month is an option. > Is a duplicate of the DAG with a different interval a better choice, or > are there strategies to > detect this in an operator and use the output of that to specify the date > window boundaries? > - when pooling is active, scheduling takes a lot more time. Even when the > pool is 10 and the number > of instances 7, it takes longer for the instances to actually run. > > Looking forward to your comments on how some approaches could be improved. > > Rgds, > > Gerard > > > On Wed, Oct 19, 2016 at 8:17 AM, Gerard Toonstra <[email protected]> > wrote: > > > > > Thanks Max, > > > > I think it always helps when new people start using software to see what > > their issues are. > > > > Some of it was also taken from the video on best practices in nov. 2015 > on > > this page: > > > > https://www.youtube.com/watch?v=dgaoqOZlvEA&feature=youtu.be > > > > ---- > > > > I made some more progress yesterday, but ran into issue 137. I think I > > solved it by depends_on_past, > > but I'm going to rely on the LatestOnlyOperator instead (it's better) and > > then work out something better > > from there. > > > > Rgds, > > > > Gerard > > > > > > On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin < > > [email protected]> wrote: > > > >> This is an amazing thread to follow! I'm really interested to watch best > >> practices documentation emerge out of the community. > >> > >> Gerard, I enjoyed reading your docs and would love to see this grow. > I've > >> been meaning to write a series of blog posts on the subject for quite > some > >> time. It seems like you have a really good start. We could integrate > this > >> as a "Best Practice" section to our current documentation once we build > >> consensus about the content. > >> > >> Laura, please post on this mailing list once the talk is up as a video, > >> I'd > >> love to watch it. > >> > >> A related best practice I'd like to write about is the idea of applying > >> some concepts of functional programing to ETL. The idea is to use > >> immutable > >> datasets/datablocks systematically as sources to your computations, in > >> ways > >> that any task instance sources from immutable datasets that are > persisted > >> in your backend. That allows to satisfy the guarantee that re-running > any > >> chunk of ETL at different point in time should lead to the exact same > >> result. It also usually means that you need to 1-do incremental loads, > and > >> 2- "snapshot" your dimension/referential/small tables in time to make > sure > >> that running the ETL from 26 days ago sources from the dimension > snapshot > >> as it was back then and yields the exact same result. > >> > >> Anyhow, it's a complex and important subject I should probably write > about > >> in a structured way sometime. > >> > >> Max > >> > >> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <[email protected]> > >> wrote: > >> > >> > >> > > >
