Maxime, Just wanted to thank you for writing this article - much like the original articles by Jeff Hammerbacher and DJ Patil coining the term "Data Scientist", I feel this article stands as a great explanation of what the title of "Data Engineer" means today.. As someone who has been working in this role before the title existed, many of the points here rang true about how the technology and tools have evolved..
I started my career working with graphical ETL tools (Informatica) and could never shake the feeling that I could get a lot more done, with a more maintainable set of processes, if I could just write reusable functions in any programming language and then keep them in a shared library. Instead, what the GUI tools forced upon us were massive Wiki documents laying out 'the 9 steps you need to follow perfectly in order to build a proper Informatica workflow' , that developers would painfully need to follow along with, rather than being able to encapsulate the things that didn't change in one central 'function' to pass in parameters for the things that varied from the defaults. I also spent a lot of time early in my career trying to design data warehouse tables using the Kimball methodology with star schemas and all dimensions extracted out to separate dimension tables. As columnar storage formats with compression became available (Vertica/Parquet/etc), I started gravitating more towards the idea that I could just store the raw string dimension data in the fact table directly, denormalized, but it always felt like I was breaking the 'purist' rules on how to design data warehouse schemas 'the right way'.. So in that regard, thanks for validating my feeling that its ok to keep denormalized dimension data directly in fact tables - it definitely makes our queries easier to write, and as you mentioned, has the added benefit of helping you avoid all of that SCD fun! We're about to put Airflow into production at my company (MLB.com) for a handful of DAGs to start, so it will be running alongside our existing Informatica server running 500+ workflows nightly.. But I can already see the writing on the wall - it's really hard for us to find talented engineers with Informatica experience along with more general computer engineering backgrounds (many seem to have specialized in purely Informatica) - so our newer engineers come in with strong Python/SQL backgrounds and have been gravitating towards building newer jobs in Airflow... One item that I think deserves addition to this article is the continuing prevalence of SQL. Many technologies have changed, but SQL has persisted (pun intended?). We went through a phase for a few years where it looked like the tide was turning to MapReduce, Pig, or other languages for accessing and aggregating data.. But now it seems even the "NoSQL" data stores have added SQL layers on top, and we have more SQL engines for Hadoop than I can count. SQL is easy to learn but tougher to master, so to me the two main languages in any modern Data Engineer's toolbelt are SQL and a scripting language (Python/Ruby).. I think it's amazing that with so much change in every aspect of how we do data warehousing, SQL has stood the test of time... Anyways, thanks again for writing this up, I'll definitely be sharing it with my team! -Rob On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin < [email protected]> wrote: > Hey I just published an article about the "Data Engineer" role in modern > organizations and thought it could be of interest to this community. > > https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer- > 91be18f1e603#.5rkm4htnf > > Max >
