Re: ETL best practices for airflow

Boris Tyukin Tue, 25 Oct 2016 09:46:44 -0700

Hi Gerard, I like your examples but compared to the first sections of your
document, that last section felt a bit rushed. By looking at actual example
and your comments above (the issues you discovered) I was able to
comprehend but for people new to Airflow it might be a bit confusing. Why
you did not document these pitfalls/issues right in the doc? All of them
are very valid point and would save time for other people.


The big thing for me is exactly this:

 a better strategy to process a large backfill if the desired schedule is
1 day. Processing 700+
  days is going to take a lot of time and overhead when processing per
month is an option.
  Is a duplicate of the DAG with a different interval a better choice, or
are there strategies to
  detect this in an operator and use the output of that to specify the date
window boundaries?

Personally the more I think about it, the more I am inclined to use new
Sid's feature (only run latest) and store ETL timestamps outside of Airflow
- so very much traditional incremental ETL process. I am also reading that
many people stay away from backfills.

Something else for you consider for your document is to describe typical
support scenarios once you have your pipeline running:

1) retries of tasks / Dags (not possible to retry entire DAG right now I
think)
2) reload of already loaded data (exact steps to do that)
3) typical troubleshooting steps


On Sat, Oct 22, 2016 at 7:07 PM, Gerard Toonstra <[email protected]>
wrote:

> Hi all,
>
> So I worked out a full pipeline for a toy data warehouse on postgres:
>
> https://gtoonstra.github.io/etl-with-airflow/fullexample.html
>
> https://github.com/gtoonstra/etl-with-airflow/tree/master/
> examples/full-example
>
> It demonstrates pretty much all listed principles for ETL work except for
> alerting and monitoring.
> Just some work TBD on the DDL and a full code review on naming conventions.
>
> Things I ran into:
> - Issue 137, max_active_runs doesn't work after clearing tasks, it does in
> the very first run.
> - parameters for standard PostgresqlOperator are not templated, so couldn't
> use the core operator.
> - it's a good idea to specify "depends_on_past" when using sensors,
> otherwise sensors could
>   exhaust available processing slots.
> - a better strategy to process a large backfill if the desired schedule is
> 1 day. Processing 700+
>   days is going to take a lot of time and overhead when processing per
> month is an option.
>   Is a duplicate of the DAG with a different interval a better choice, or
> are there strategies to
>   detect this in an operator and use the output of that to specify the date
> window boundaries?
> - when pooling is active, scheduling takes a lot more time. Even when the
> pool is 10 and the number
>    of instances 7, it takes longer for the instances to actually run.
>
> Looking forward to your comments on how some approaches could be improved.
>
> Rgds,
>
> Gerard
>
>
> On Wed, Oct 19, 2016 at 8:17 AM, Gerard Toonstra <[email protected]>
> wrote:
>
> >
> > Thanks Max,
> >
> > I think it always helps when new people start using software to see what
> > their issues are.
> >
> > Some of it was also taken from the video on best practices in nov. 2015
> on
> > this page:
> >
> > https://www.youtube.com/watch?v=dgaoqOZlvEA&feature=youtu.be
> >
> > ----
> >
> > I made some more progress yesterday, but ran into issue 137. I think I
> > solved it by depends_on_past,
> > but I'm going to rely on the LatestOnlyOperator instead (it's better) and
> > then work out something better
> > from there.
> >
> > Rgds,
> >
> > Gerard
> >
> >
> > On Tue, Oct 18, 2016 at 6:02 PM, Maxime Beauchemin <
> > [email protected]> wrote:
> >
> >> This is an amazing thread to follow! I'm really interested to watch best
> >> practices documentation emerge out of the community.
> >>
> >> Gerard, I enjoyed reading your docs and would love to see this grow.
> I've
> >> been meaning to write a series of blog posts on the subject for quite
> some
> >> time. It seems like you have a really good start. We could integrate
> this
> >> as a "Best Practice" section to our current documentation once we build
> >> consensus about the content.
> >>
> >> Laura, please post on this mailing list once the talk is up as a video,
> >> I'd
> >> love to watch it.
> >>
> >> A related best practice I'd like to write about is the idea of applying
> >> some concepts of functional programing to ETL. The idea is to use
> >> immutable
> >> datasets/datablocks systematically as sources to your computations, in
> >> ways
> >> that any task instance sources from immutable datasets that are
> persisted
> >> in your backend. That allows to satisfy the guarantee that re-running
> any
> >> chunk of ETL at different point in time should lead to the exact same
> >> result. It also usually means that you need to 1-do incremental loads,
> and
> >> 2- "snapshot" your dimension/referential/small tables in time to make
> sure
> >> that running the ETL from 26 days ago sources from the dimension
> snapshot
> >> as it was back then and yields the exact same result.
> >>
> >> Anyhow, it's a complex and important subject I should probably write
> about
> >> in a structured way sometime.
> >>
> >> Max
> >>
> >> On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <[email protected]>
> >> wrote:
> >>
> >>
> >>
> >
>

Re: ETL best practices for airflow

Reply via email to