You mentioned Vertica and Parquet. Is it recommended to use these newer tools even when the DWH is not BigData size (about 150G in size) ?
So there are a couple of good benefits, but are there any downsides and disadvantages you have to take into account comparing Vertica vs. SQL Server for example? If you really recommend Vertica over SQL Server, I'm looking at doing a PoC here to see where it goes... Rgds, Gerard On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky <[email protected]> wrote: > Maxime, > Just wanted to thank you for writing this article - much like the original > articles by Jeff Hammerbacher and DJ Patil coining the term "Data > Scientist", I feel this article stands as a great explanation of what the > title of "Data Engineer" means today.. As someone who has been working in > this role before the title existed, many of the points here rang true about > how the technology and tools have evolved.. > > I started my career working with graphical ETL tools (Informatica) and > could never shake the feeling that I could get a lot more done, with a more > maintainable set of processes, if I could just write reusable functions in > any programming language and then keep them in a shared library. Instead, > what the GUI tools forced upon us were massive Wiki documents laying out > 'the 9 steps you need to follow perfectly in order to build a proper > Informatica workflow' , that developers would painfully need to follow > along with, rather than being able to encapsulate the things that didn't > change in one central 'function' to pass in parameters for the things that > varied from the defaults. > > I also spent a lot of time early in my career trying to design data > warehouse tables using the Kimball methodology with star schemas and all > dimensions extracted out to separate dimension tables. As columnar storage > formats with compression became available (Vertica/Parquet/etc), I started > gravitating more towards the idea that I could just store the raw string > dimension data in the fact table directly, denormalized, but it always felt > like I was breaking the 'purist' rules on how to design data warehouse > schemas 'the right way'.. So in that regard, thanks for validating my > feeling that its ok to keep denormalized dimension data directly in fact > tables - it definitely makes our queries easier to write, and as you > mentioned, has the added benefit of helping you avoid all of that SCD fun! > > We're about to put Airflow into production at my company (MLB.com) for a > handful of DAGs to start, so it will be running alongside our existing > Informatica server running 500+ workflows nightly.. But I can already see > the writing on the wall - it's really hard for us to find talented > engineers with Informatica experience along with more general computer > engineering backgrounds (many seem to have specialized in purely > Informatica) - so our newer engineers come in with strong Python/SQL > backgrounds and have been gravitating towards building newer jobs in > Airflow... > > One item that I think deserves addition to this article is the continuing > prevalence of SQL. Many technologies have changed, but SQL has persisted > (pun intended?). We went through a phase for a few years where it looked > like the tide was turning to MapReduce, Pig, or other languages for > accessing and aggregating data.. But now it seems even the "NoSQL" data > stores have added SQL layers on top, and we have more SQL engines for > Hadoop than I can count. SQL is easy to learn but tougher to master, so > to me the two main languages in any modern Data Engineer's toolbelt are SQL > and a scripting language (Python/Ruby).. I think it's amazing that with > so much change in every aspect of how we do data warehousing, SQL has stood > the test of time... > > Anyways, thanks again for writing this up, I'll definitely be sharing it > with my team! > > -Rob > > > > > > > > > > On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin < > [email protected]> wrote: > > > Hey I just published an article about the "Data Engineer" role in modern > > organizations and thought it could be of interest to this community. > > > > https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer- > > 91be18f1e603#.5rkm4htnf > > > > Max > > >
