Max, really really nice post and I like your style of writing - please
continue sharing your experience and inspire many of us working in more
traditional environments ;) I shared your post with our leadership and
hopefully we will have data engineers soon on our team! As far as UI vs.
coding, I am not sure I fully agree as we look at software development
history, we will see times when programming was the only answer and
required hardcore professionals like you but then commercial applications
which were very visual and lowered requirements to the skillset need.
Informatica, SSIS and others became hugely popular and many people swear
they save time if you know how to use them. I am pretty sure we will see
new tools in Big Data arena as well (AtScale is one example) that make
things easier for less skilled developers and users.

It is also good timing for me as my company evaluating Informatica Big Data
Management addon (which competes with Talend Big Data) - I am not sold yet
on why we would need it if we can do much more with Python and Spark and
Hive. But the key point Informatica folks make is to lower the requirements
for the skills of developers and to leverage existing skills with
Informatica and SQL. I think this is important because this is exactly why
SQL is still a huge player in Big Data world - people love SQL, they can do
a lot with SQL and they want to use their SQL experience they've built over
their carrier.

The dimensional modelling question you raised is also very interesting but
very arguable. I was thinking about it before and still did not come to
believe that flat tables is a way to go. You said it yourself that there is
still a place for highly accurate (certified) enterprise wide warehouse and
one still need to spend a lot of time thinking about use cases and design
to support them. I am not sure I like the abundance of de-normilized tables
in Big Data world but I do see your point about SCDs and all the pain to
maintain a traditional star schema DW. But dimensional modelling is not
really about maintenance or making life easier for ETL developers - IMHO it
is about structuring data to simply business and data analytics. It is
about rigorous process to conform data from multiple source systems. It is
about data quality and trust. Finally it is about better performing DW (by
nature of RDBMS which are very good at joining tables by foreign keys) -
the last benefit though is not relevant in Hadoop since we can reprocess or
query data more efficiently.

Gerard, why would you do that? if you have the skills already with SQL
Server and your DWH is tiny (I run 500Gb DWH in SQL Server on a weak
machine), you should be fine with SQL Server. The only issue you cannot
support fast BI queries. But you have enterprise license, you can easily
dump your table in tabular in memory cube and most of your queries will be
running in under 2 seconds. Vertica is cool but the learning curve is
pretty steep and it really shines on big de-normalized tables as join
performance might is not that good. I work with a large healthcare vendor
and they have Tb size tables in their Vertica db - most of them are flatten
out but they still have dimensions and facts, just less then you would
normally have with traditional star schema design.



On Wed, Jan 25, 2017 at 5:57 AM, Gerard Toonstra <[email protected]>
wrote:

> You mentioned Vertica and Parquet. Is it recommended to use these newer
> tools even when the DWH is not BigData
> size (about 150G in size) ?
>
> So there are a couple of good benefits, but are there any downsides and
> disadvantages you have to take into account
> comparing Vertica vs. SQL Server for example?
>
> If you really recommend Vertica over SQL Server, I'm looking at doing a PoC
> here to see where it goes...
>
> Rgds,
>
> Gerard
>
>
> On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky <[email protected]>
> wrote:
>
> > Maxime,
> > Just wanted to thank you for writing this article - much like the
> original
> > articles by Jeff Hammerbacher and DJ Patil coining the term "Data
> > Scientist", I feel this article stands as a great explanation of what the
> > title of "Data Engineer" means today..  As someone who has been working
> in
> > this role before the title existed, many of the points here rang true
> about
> > how the technology and tools have evolved..
> >
> > I started my career working with graphical ETL tools (Informatica) and
> > could never shake the feeling that I could get a lot more done, with a
> more
> > maintainable set of processes, if I could just write reusable functions
> in
> > any programming language and then keep them in a shared library.
> Instead,
> > what the GUI tools forced upon us were massive Wiki documents laying out
> > 'the 9 steps you need to follow perfectly in order to build a proper
> > Informatica workflow' , that developers would painfully need to follow
> > along with, rather than being able to encapsulate the things that didn't
> > change in one central 'function' to pass in parameters for the things
> that
> > varied from the defaults.
> >
> > I also spent a lot of time early in my career trying to design data
> > warehouse tables using the Kimball methodology with star schemas and all
> > dimensions extracted out to separate dimension tables.  As columnar
> storage
> > formats with compression became available (Vertica/Parquet/etc), I
> started
> > gravitating more towards the idea that I could just store the raw string
> > dimension data in the fact table directly, denormalized, but it always
> felt
> > like I was breaking the 'purist' rules on how to design data warehouse
> > schemas 'the right way'..  So in that regard, thanks for validating my
> > feeling that its ok to keep denormalized dimension data directly in fact
> > tables - it definitely makes our queries easier to write, and as you
> > mentioned, has the added benefit of helping you avoid all of that SCD
> fun!
> >
> > We're about to put Airflow into production at my company (MLB.com) for a
> > handful of DAGs to start, so it will be running alongside our existing
> > Informatica server running 500+ workflows nightly..  But I can already
> see
> > the writing on the wall - it's really hard for us to find talented
> > engineers with Informatica experience along with more general computer
> > engineering backgrounds (many seem to have specialized in purely
> > Informatica) -  so our newer engineers come in with strong Python/SQL
> > backgrounds and have been gravitating towards building newer jobs in
> > Airflow...
> >
> > One item that I think deserves addition to this article is the continuing
> > prevalence of SQL.   Many technologies have changed, but SQL has
> persisted
> > (pun intended?).  We went through a phase for a few years where it looked
> > like the tide was turning to MapReduce, Pig, or other languages for
> > accessing and aggregating data..  But now it seems even the "NoSQL" data
> > stores have added SQL layers on top, and we have more SQL engines for
> > Hadoop than I can count.   SQL is easy to learn but tougher to master, so
> > to me the two main languages in any modern Data Engineer's toolbelt are
> SQL
> > and a scripting language (Python/Ruby)..   I think it's amazing that with
> > so much change in every aspect of how we do data warehousing, SQL has
> stood
> > the test of time...
> >
> > Anyways, thanks again for writing this up, I'll definitely be sharing it
> > with my team!
> >
> > -Rob
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
> > [email protected]> wrote:
> >
> > > Hey I just published an article about the "Data Engineer" role in
> modern
> > > organizations and thought it could be of interest to this community.
> > >
> > > https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> > > 91be18f1e603#.5rkm4htnf
> > >
> > > Max
> > >
> >
>

Reply via email to