Glad to hear the article resonated with you! I just now got interviewed on a podcast on this very subject, it should be up sometime this week: https://itunes.apple.com/us/podcast/data-engineering-podcast/id1193040557
It's less structured than the article, but you can hear me babble about data engineering and say semi-outrageous things about data scientists if you have the patience of sitting through it :) I totally agree about SQL, it's the one solid constant in this ever changing space. Screw SCDs! Max On Tue, Jan 24, 2017 at 3:39 PM, Rob Goretsky <[email protected]> wrote: > Maxime, > Just wanted to thank you for writing this article - much like the original > articles by Jeff Hammerbacher and DJ Patil coining the term "Data > Scientist", I feel this article stands as a great explanation of what the > title of "Data Engineer" means today.. As someone who has been working in > this role before the title existed, many of the points here rang true about > how the technology and tools have evolved.. > > I started my career working with graphical ETL tools (Informatica) and > could never shake the feeling that I could get a lot more done, with a more > maintainable set of processes, if I could just write reusable functions in > any programming language and then keep them in a shared library. Instead, > what the GUI tools forced upon us were massive Wiki documents laying out > 'the 9 steps you need to follow perfectly in order to build a proper > Informatica workflow' , that developers would painfully need to follow > along with, rather than being able to encapsulate the things that didn't > change in one central 'function' to pass in parameters for the things that > varied from the defaults. > > I also spent a lot of time early in my career trying to design data > warehouse tables using the Kimball methodology with star schemas and all > dimensions extracted out to separate dimension tables. As columnar storage > formats with compression became available (Vertica/Parquet/etc), I started > gravitating more towards the idea that I could just store the raw string > dimension data in the fact table directly, denormalized, but it always felt > like I was breaking the 'purist' rules on how to design data warehouse > schemas 'the right way'.. So in that regard, thanks for validating my > feeling that its ok to keep denormalized dimension data directly in fact > tables - it definitely makes our queries easier to write, and as you > mentioned, has the added benefit of helping you avoid all of that SCD fun! > > We're about to put Airflow into production at my company (MLB.com) for a > handful of DAGs to start, so it will be running alongside our existing > Informatica server running 500+ workflows nightly.. But I can already see > the writing on the wall - it's really hard for us to find talented > engineers with Informatica experience along with more general computer > engineering backgrounds (many seem to have specialized in purely > Informatica) - so our newer engineers come in with strong Python/SQL > backgrounds and have been gravitating towards building newer jobs in > Airflow... > > One item that I think deserves addition to this article is the continuing > prevalence of SQL. Many technologies have changed, but SQL has persisted > (pun intended?). We went through a phase for a few years where it looked > like the tide was turning to MapReduce, Pig, or other languages for > accessing and aggregating data.. But now it seems even the "NoSQL" data > stores have added SQL layers on top, and we have more SQL engines for > Hadoop than I can count. SQL is easy to learn but tougher to master, so > to me the two main languages in any modern Data Engineer's toolbelt are SQL > and a scripting language (Python/Ruby).. I think it's amazing that with > so much change in every aspect of how we do data warehousing, SQL has stood > the test of time... > > Anyways, thanks again for writing this up, I'll definitely be sharing it > with my team! > > -Rob > > > > > > > > > > On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin < > [email protected]> wrote: > > > Hey I just published an article about the "Data Engineer" role in modern > > organizations and thought it could be of interest to this community. > > > > https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer- > > 91be18f1e603#.5rkm4htnf > > > > Max > > >
