This is an amazing thread to follow! I'm really interested to watch best
practices documentation emerge out of the community.

Gerard, I enjoyed reading your docs and would love to see this grow. I've
been meaning to write a series of blog posts on the subject for quite some
time. It seems like you have a really good start. We could integrate this
as a "Best Practice" section to our current documentation once we build
consensus about the content.

Laura, please post on this mailing list once the talk is up as a video, I'd
love to watch it.

A related best practice I'd like to write about is the idea of applying
some concepts of functional programing to ETL. The idea is to use immutable
datasets/datablocks systematically as sources to your computations, in ways
that any task instance sources from immutable datasets that are persisted
in your backend. That allows to satisfy the guarantee that re-running any
chunk of ETL at different point in time should lead to the exact same
result. It also usually means that you need to 1-do incremental loads, and
2- "snapshot" your dimension/referential/small tables in time to make sure
that running the ETL from 26 days ago sources from the dimension snapshot
as it was back then and yields the exact same result.

Anyhow, it's a complex and important subject I should probably write about
in a structured way sometime.

Max

On Mon, Oct 17, 2016 at 6:12 PM, Boris Tyukin <bo...@boristyukin.com> wrote:

> Thanks for sharing your slides, Laura! I think I've watched all the airflow
> related slides I could find and you did a very good job - adding your
> slides to my collection :)  I especially liked how were explaining
> execution date concept but I wish you could elaborate on a backfill concept
> and running the same dag in parallel (if you guys do this sort of thing) -
> I think this the most confusing thing of Airflow that needs good
> explanation / examples.
>
> On Mon, Oct 17, 2016 at 5:19 PM, Laura Lorenz <llor...@industrydive.com>
> wrote:
>
> > Same! I actually recently gave a talk about how my company uses airflow
> at
> > PyData DC. The video isn't live yet, but the slides are here
> > <http://www.slideshare.net/LauraLorenz4/how-i-learned-to-
> > time-travel-or-data-pipelining-and-scheduling-with-airflow>.
> > In substance it's actually very similar to what you've written.
> >
> > I have some airflow-specific ideas about ways to write custom sensors
> that
> > poll job apis (pretty common for us). We do dynamic generation of tasks
> > using external metadata by embedding an API call in the DAG definition
> > file, which I'm not sure is a best practice or not...
> >
> > Anyways, if it makes sense to contribute these case studies for
> > consideration as a 'best practice', if this is the place or way to do it,
> > I'm game. I agree that the resources and thought leadership on ETL design
> > is fragmented, and think the Airflow community is fertile ground to
> provide
> > discussion about it.
> >
> > On Sun, Oct 16, 2016 at 6:40 PM, Boris Tyukin <bo...@boristyukin.com>
> > wrote:
> >
> > > I really look forward to it, Gerard! I've read what you you wrote so
> far
> > > and I really liked it - please keep up the great job!
> > >
> > > I am hoping to see some best practices for the design of incremental
> > loads
> > > and using timestamps from source database systems (not being on UTC so
> > > still confused about it in Airflow). Also practical use of subdags and
> > > dynamic generation of tasks using some external metadata (maybe
> describe
> > in
> > > details something similar that wepay did
> > > https://wecode.wepay.com/posts/airflow-wepay)
> > >
> > >
> > > On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <gtoons...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > About a year ago, I contributed the HTTPOperator/Sensor and I've been
> > > > tracking airflow since. Right now it looks like we're going to adopt
> > > > airflow at the company I'm currently working at.
> > > >
> > > > In preparation for that, I've done a bit of research work how airflow
> > > > pipelines should fit together, how important ETL principles are
> covered
> > > and
> > > > decided to write this up on a documentation site. The airflow
> > > documentation
> > > > site contains everything on how all airflow works and the constructs
> > that
> > > > you have available to build pipelines, but it can still be a
> challenge
> > > for
> > > > newcomers to figure out how to put those constructs together to use
> it
> > > > effectively.
> > > >
> > > > The articles I found online don't go into a lot of detail either.
> > Airflow
> > > > is built around an important philosophy towards ETL and there's a
> risk
> > > that
> > > > newcomers simply pick up a really great tool and start off in the
> wrong
> > > way
> > > > when using it.
> > > >
> > > >
> > > > This weekend, I set off to write some documentation to try to fill
> this
> > > > gap. It starts off with a generic understanding of important ETL
> > > principles
> > > > and I'm currently working on a practical step-by-step example that
> > > adheres
> > > > to these principles with DAG implementations in airflow; i.e. showing
> > how
> > > > it can all fit together.
> > > >
> > > > You can find the current version here:
> > > >
> > > > https://gtoonstra.github.io/etl-with-airflow/index.html
> > > >
> > > >
> > > > Looking forward to your comments. If you have better ideas how I can
> > make
> > > > this contribution, don't hesitate to contact me with your
> suggestions.
> > > >
> > > > Best regards,
> > > >
> > > > Gerard
> > > >
> > >
> >
>

Reply via email to