Re: ETL best practices for airflow

Gerard Toonstra Tue, 25 Oct 2016 23:16:21 -0700

Hi Boris,

Thanks very much!  These are all valid points and I'll work them out in the
next couple of days.
Indeed, the documentation of what the code actually does is rather limited,
I should probably have described
the overall strategy , so the code is easier to follow.


Thanks for the ideas for another page (support). That's also a very cool
thing to explain and work on.
If anyone has additional points of maintenance that happen regularly,
please add to the discussion.

Rgds,

Gerard





On Tue, Oct 25, 2016 at 6:45 PM, Boris Tyukin <[email protected]> wrote:

> Hi Gerard, I like your examples but compared to the first sections of your
> document, that last section felt a bit rushed. By looking at actual example
> and your comments above (the issues you discovered) I was able to
> comprehend but for people new to Airflow it might be a bit confusing. Why
> you did not document these pitfalls/issues right in the doc? All of them
> are very valid point and would save time for other people.
>
> The big thing for me is exactly this:
>
>  a better strategy to process a large backfill if the desired schedule is
> 1 day. Processing 700+
>   days is going to take a lot of time and overhead when processing per
> month is an option.
>   Is a duplicate of the DAG with a different interval a better choice, or
> are there strategies to
>   detect this in an operator and use the output of that to specify the date
> window boundaries?
>
> Personally the more I think about it, the more I am inclined to use new
> Sid's feature (only run latest) and store ETL timestamps outside of Airflow
> - so very much traditional incremental ETL process. I am also reading that
> many people stay away from backfills.
>
> Something else for you consider for your document is to describe typical
> support scenarios once you have your pipeline running:
>
> 1) retries of tasks / Dags (not possible to retry entire DAG right now I
> think)
> 2) reload of already loaded data (exact steps to do that)
> 3) typical troubleshooting steps
>
>
> On Sat, Oct 22, 2016 at 7:07 PM, Gerard Toonstra <[email protected]>
> wrote:
>
> > Hi all,
> >
> > So I worked out a full pipeline for a toy data warehouse on postgres:
> >
> > https://gtoonstra.github.io/etl-with-airflow/fullexample.html
> >
> > https://github.com/gtoonstra/etl-with-airflow/tree/master/
> > examples/full-example
> >
> > It demonstrates pretty much all listed principles for ETL work except for
> > alerting and monitoring.
> > Just some work TBD on the DDL and a full code review on naming
> conventions.
> >
> > Things I ran into:
> > - Issue 137, max_active_runs doesn't work after clearing tasks, it does
> in
> > the very first run.
> > - parameters for standard PostgresqlOperator are not templated, so
> couldn't
> > use the core operator.
> > - it's a good idea to specify "depends_on_past" when using sensors,
> > otherwise sensors could
> >   exhaust available processing slots.
> > - a better strategy to process a large backfill if the desired schedule
> is
> > 1 day. Processing 700+
> >   days is going to take a lot of time and overhead when processing per
> > month is an option.
> >   Is a duplicate of the DAG with a different interval a better choice, or
> > are there strategies to
> >   detect this in an operator and use the output of that to specify the
> date
> > window boundaries?
> > - when pooling is active, scheduling takes a lot more time. Even when the
> > pool is 10 and the number
> >    of instances 7, it takes longer for the instances to actually run.
> >
> > Looking forward to your comments on how some approaches could be
> improved.
> >
> > Rgds,
> >
> > Gerard
> >
> >
> > On Wed, Oct 19, 2016 at 8:17 AM, Gerard Toonstra <[email protected]>
> > wrote:
> >
> > >
> > > Thanks Max,
> > >
> > > I think it always helps when new people start using software to see
> what
> > > their issues are.
> > >
> > > Some of it was also taken from the video on best practices in nov. 2015
> > on
> > > this page:
> > >
> > > https://www.youtube.com/watch?v=dgaoqOZlvEA&feature=youtu.be
> > >
> > > ----
> > >
> > > I made some more progress yesterday, but ran into issue 137. I think I
> > > solved it by depends_on_past,
> > > but I'm going to rely on the LatestOnlyOperator instead (it's better)
> and
> > > then work out something better
> > > from there.
> > >
> > > Rgds,
> > >
> > > Gerard
> > >
> > >
> > > On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin <
> > > [email protected]> wrote:
> > >
> > >> This is an amazing thread to follow! I'm really interested to watch
> best
> > >> practices documentation emerge out of the community.
> > >>
> > >> Gerard, I enjoyed reading your docs and would love to see this grow.
> > I've
> > >> been meaning to write a series of blog posts on the subject for quite
> > some
> > >> time. It seems like you have a really good start. We could integrate
> > this
> > >> as a "Best Practice" section to our current documentation once we
> build
> > >> consensus about the content.
> > >>
> > >> Laura, please post on this mailing list once the talk is up as a
> video,
> > >> I'd
> > >> love to watch it.
> > >>
> > >> A related best practice I'd like to write about is the idea of
> applying
> > >> some concepts of functional programing to ETL. The idea is to use
> > >> immutable
> > >> datasets/datablocks systematically as sources to your computations, in
> > >> ways
> > >> that any task instance sources from immutable datasets that are
> > persisted
> > >> in your backend. That allows to satisfy the guarantee that re-running
> > any
> > >> chunk of ETL at different point in time should lead to the exact same
> > >> result. It also usually means that you need to 1-do incremental loads,
> > and
> > >> 2- "snapshot" your dimension/referential/small tables in time to make
> > sure
> > >> that running the ETL from 26 days ago sources from the dimension
> > snapshot
> > >> as it was back then and yields the exact same result.
> > >>
> > >> Anyhow, it's a complex and important subject I should probably write
> > about
> > >> in a structured way sometime.
> > >>
> > >> Max
> > >>
> > >> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <[email protected]>
> > >> wrote:
> > >>
> > >>
> > >>
> > >
> >
>

Re: ETL best practices for airflow

Reply via email to