Hi Laura, Looks very good. What I had to do first when I started was to figure out relevant concepts for ETL, I don't have a BI background. When I follow the tutorial and look at the examples, it's clear what airflow can do conceptually, but as soon as I want to get started on something, there's no clear idea what real life DAGs look like at all. So I see myself and probably others stare at blank screens a lot, getting lost.
I like your approach and thinking on putting data in a common place between operators and worrying about where data resides between steps. Huge monolith operators aren't reusable and it makes sense to break down code into logical units of generic pieces of code that are easy to configure through lambda functions or parameters. I'm not sure if the dev list is the right way to have these 'user' type of discussions, but let's see what they have to say about this. Rgds, Gerard On Tue, Oct 18, 2016 at 3:12 AM, Boris Tyukin <bo...@boristyukin.com> wrote: > Thanks for sharing your slides, Laura! I think I've watched all the airflow > related slides I could find and you did a very good job - adding your > slides to my collection :) I especially liked how were explaining > execution date concept but I wish you could elaborate on a backfill concept > and running the same dag in parallel (if you guys do this sort of thing) - > I think this the most confusing thing of Airflow that needs good > explanation / examples. > > On Mon, Oct 17, 2016 at 5:19 PM, Laura Lorenz <llor...@industrydive.com> > wrote: > > > Same! I actually recently gave a talk about how my company uses airflow > at > > PyData DC. The video isn't live yet, but the slides are here > > <http://www.slideshare.net/LauraLorenz4/how-i-learned-to- > > time-travel-or-data-pipelining-and-scheduling-with-airflow>. > > In substance it's actually very similar to what you've written. > > > > I have some airflow-specific ideas about ways to write custom sensors > that > > poll job apis (pretty common for us). We do dynamic generation of tasks > > using external metadata by embedding an API call in the DAG definition > > file, which I'm not sure is a best practice or not... > > > > Anyways, if it makes sense to contribute these case studies for > > consideration as a 'best practice', if this is the place or way to do it, > > I'm game. I agree that the resources and thought leadership on ETL design > > is fragmented, and think the Airflow community is fertile ground to > provide > > discussion about it. > > > > On Sun, Oct 16, 2016 at 6:40 PM, Boris Tyukin <bo...@boristyukin.com> > > wrote: > > > > > I really look forward to it, Gerard! I've read what you you wrote so > far > > > and I really liked it - please keep up the great job! > > > > > > I am hoping to see some best practices for the design of incremental > > loads > > > and using timestamps from source database systems (not being on UTC so > > > still confused about it in Airflow). Also practical use of subdags and > > > dynamic generation of tasks using some external metadata (maybe > describe > > in > > > details something similar that wepay did > > > https://wecode.wepay.com/posts/airflow-wepay) > > > > > > > > > On Sun, Oct 16, 2016 at 5:23 PM, Gerard Toonstra <gtoons...@gmail.com> > > > wrote: > > > > > > > Hi all, > > > > > > > > About a year ago, I contributed the HTTPOperator/Sensor and I've been > > > > tracking airflow since. Right now it looks like we're going to adopt > > > > airflow at the company I'm currently working at. > > > > > > > > In preparation for that, I've done a bit of research work how airflow > > > > pipelines should fit together, how important ETL principles are > covered > > > and > > > > decided to write this up on a documentation site. The airflow > > > documentation > > > > site contains everything on how all airflow works and the constructs > > that > > > > you have available to build pipelines, but it can still be a > challenge > > > for > > > > newcomers to figure out how to put those constructs together to use > it > > > > effectively. > > > > > > > > The articles I found online don't go into a lot of detail either. > > Airflow > > > > is built around an important philosophy towards ETL and there's a > risk > > > that > > > > newcomers simply pick up a really great tool and start off in the > wrong > > > way > > > > when using it. > > > > > > > > > > > > This weekend, I set off to write some documentation to try to fill > this > > > > gap. It starts off with a generic understanding of important ETL > > > principles > > > > and I'm currently working on a practical step-by-step example that > > > adheres > > > > to these principles with DAG implementations in airflow; i.e. showing > > how > > > > it can all fit together. > > > > > > > > You can find the current version here: > > > > > > > > https://gtoonstra.github.io/etl-with-airflow/index.html > > > > > > > > > > > > Looking forward to your comments. If you have better ideas how I can > > make > > > > this contribution, don't hesitate to contact me with your > suggestions. > > > > > > > > Best regards, > > > > > > > > Gerard > > > > > > > > > >