Re: Article: The Rise of the Data Engineer

Brian Van Klaveren Wed, 25 Jan 2017 07:48:56 -0800

There's also monetdb and greenplum, depending on your data size, which 
https://www.monetdb.org, which support columnar tables if you want to get your 
feet wet. If your data is actually more array-like, you might try out scidb.


Per this email thread, it almost sounds like a slack team/discourse for data 
engineering might be useful.

> On Jan 25, 2017, at 7:28 AM, Rob Goretsky <[email protected]> wrote:
> 
> @Gerard - I mentioned Vertica just as one of the first examples of a system
> that offers columnar storage.  You might actually see a significant benefit
> using columnar storage with even a smaller table, as small as a few GB -
> Columnar storage works well if you have wide fact tables with many columns
> and often query on just a few of those columns.  The downside to columnar
> storage is that if you often SELECT *, or many, of the columns from the
> table at once, it will actually be slower than if you had stored the data
> in traditional 'row-based' storage.  Also, updates and deletes can be
> slower with columnar storage, so it works best if you have wide,
> INSERT-only fact tables.   That said, I think there are better options than
> Vertica on the market today for getting your feet wet with columnar
> storage.  If AWS is an option for you, then Redshift offers this out of the
> box, and would let you run your POC for as little as $0.25 an hour.
> Parquet is basically columnar storage for Hadoop..  Other more traditional
> data warehouse vendors like Netezza and Teradata also offer columnar
> storage as an option ...
> 
>> On Wed, Jan 25, 2017 at 9:16 AM, Boris Tyukin <[email protected]> wrote:
>> 
>> Max, really really nice post and I like your style of writing - please
>> continue sharing your experience and inspire many of us working in more
>> traditional environments ;) I shared your post with our leadership and
>> hopefully we will have data engineers soon on our team! As far as UI vs.
>> coding, I am not sure I fully agree as we look at software development
>> history, we will see times when programming was the only answer and
>> required hardcore professionals like you but then commercial applications
>> which were very visual and lowered requirements to the skillset need.
>> Informatica, SSIS and others became hugely popular and many people swear
>> they save time if you know how to use them. I am pretty sure we will see
>> new tools in Big Data arena as well (AtScale is one example) that make
>> things easier for less skilled developers and users.
>> 
>> It is also good timing for me as my company evaluating Informatica Big Data
>> Management addon (which competes with Talend Big Data) - I am not sold yet
>> on why we would need it if we can do much more with Python and Spark and
>> Hive. But the key point Informatica folks make is to lower the requirements
>> for the skills of developers and to leverage existing skills with
>> Informatica and SQL. I think this is important because this is exactly why
>> SQL is still a huge player in Big Data world - people love SQL, they can do
>> a lot with SQL and they want to use their SQL experience they've built over
>> their carrier.
>> 
>> The dimensional modelling question you raised is also very interesting but
>> very arguable. I was thinking about it before and still did not come to
>> believe that flat tables is a way to go. You said it yourself that there is
>> still a place for highly accurate (certified) enterprise wide warehouse and
>> one still need to spend a lot of time thinking about use cases and design
>> to support them. I am not sure I like the abundance of de-normilized tables
>> in Big Data world but I do see your point about SCDs and all the pain to
>> maintain a traditional star schema DW. But dimensional modelling is not
>> really about maintenance or making life easier for ETL developers - IMHO it
>> is about structuring data to simply business and data analytics. It is
>> about rigorous process to conform data from multiple source systems. It is
>> about data quality and trust. Finally it is about better performing DW (by
>> nature of RDBMS which are very good at joining tables by foreign keys) -
>> the last benefit though is not relevant in Hadoop since we can reprocess or
>> query data more efficiently.
>> 
>> Gerard, why would you do that? if you have the skills already with SQL
>> Server and your DWH is tiny (I run 500Gb DWH in SQL Server on a weak
>> machine), you should be fine with SQL Server. The only issue you cannot
>> support fast BI queries. But you have enterprise license, you can easily
>> dump your table in tabular in memory cube and most of your queries will be
>> running in under 2 seconds. Vertica is cool but the learning curve is
>> pretty steep and it really shines on big de-normalized tables as join
>> performance might is not that good. I work with a large healthcare vendor
>> and they have Tb size tables in their Vertica db - most of them are flatten
>> out but they still have dimensions and facts, just less then you would
>> normally have with traditional star schema design.
>> 
>> 
>> 
>> On Wed, Jan 25, 2017 at 5:57 AM, Gerard Toonstra <[email protected]>
>> wrote:
>> 
>>> You mentioned Vertica and Parquet. Is it recommended to use these newer
>>> tools even when the DWH is not BigData
>>> size (about 150G in size) ?
>>> 
>>> So there are a couple of good benefits, but are there any downsides and
>>> disadvantages you have to take into account
>>> comparing Vertica vs. SQL Server for example?
>>> 
>>> If you really recommend Vertica over SQL Server, I'm looking at doing a
>> PoC
>>> here to see where it goes...
>>> 
>>> Rgds,
>>> 
>>> Gerard
>>> 
>>> 
>>> On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky <
>> [email protected]>
>>> wrote:
>>> 
>>>> Maxime,
>>>> Just wanted to thank you for writing this article - much like the
>>> original
>>>> articles by Jeff Hammerbacher and DJ Patil coining the term "Data
>>>> Scientist", I feel this article stands as a great explanation of what
>> the
>>>> title of "Data Engineer" means today..  As someone who has been working
>>> in
>>>> this role before the title existed, many of the points here rang true
>>> about
>>>> how the technology and tools have evolved..
>>>> 
>>>> I started my career working with graphical ETL tools (Informatica) and
>>>> could never shake the feeling that I could get a lot more done, with a
>>> more
>>>> maintainable set of processes, if I could just write reusable functions
>>> in
>>>> any programming language and then keep them in a shared library.
>>> Instead,
>>>> what the GUI tools forced upon us were massive Wiki documents laying
>> out
>>>> 'the 9 steps you need to follow perfectly in order to build a proper
>>>> Informatica workflow' , that developers would painfully need to follow
>>>> along with, rather than being able to encapsulate the things that
>> didn't
>>>> change in one central 'function' to pass in parameters for the things
>>> that
>>>> varied from the defaults.
>>>> 
>>>> I also spent a lot of time early in my career trying to design data
>>>> warehouse tables using the Kimball methodology with star schemas and
>> all
>>>> dimensions extracted out to separate dimension tables.  As columnar
>>> storage
>>>> formats with compression became available (Vertica/Parquet/etc), I
>>> started
>>>> gravitating more towards the idea that I could just store the raw
>> string
>>>> dimension data in the fact table directly, denormalized, but it always
>>> felt
>>>> like I was breaking the 'purist' rules on how to design data warehouse
>>>> schemas 'the right way'..  So in that regard, thanks for validating my
>>>> feeling that its ok to keep denormalized dimension data directly in
>> fact
>>>> tables - it definitely makes our queries easier to write, and as you
>>>> mentioned, has the added benefit of helping you avoid all of that SCD
>>> fun!
>>>> 
>>>> We're about to put Airflow into production at my company (MLB.com) for
>> a
>>>> handful of DAGs to start, so it will be running alongside our existing
>>>> Informatica server running 500+ workflows nightly..  But I can already
>>> see
>>>> the writing on the wall - it's really hard for us to find talented
>>>> engineers with Informatica experience along with more general computer
>>>> engineering backgrounds (many seem to have specialized in purely
>>>> Informatica) -  so our newer engineers come in with strong Python/SQL
>>>> backgrounds and have been gravitating towards building newer jobs in
>>>> Airflow...
>>>> 
>>>> One item that I think deserves addition to this article is the
>> continuing
>>>> prevalence of SQL.   Many technologies have changed, but SQL has
>>> persisted
>>>> (pun intended?).  We went through a phase for a few years where it
>> looked
>>>> like the tide was turning to MapReduce, Pig, or other languages for
>>>> accessing and aggregating data..  But now it seems even the "NoSQL"
>> data
>>>> stores have added SQL layers on top, and we have more SQL engines for
>>>> Hadoop than I can count.   SQL is easy to learn but tougher to master,
>> so
>>>> to me the two main languages in any modern Data Engineer's toolbelt are
>>> SQL
>>>> and a scripting language (Python/Ruby)..   I think it's amazing that
>> with
>>>> so much change in every aspect of how we do data warehousing, SQL has
>>> stood
>>>> the test of time...
>>>> 
>>>> Anyways, thanks again for writing this up, I'll definitely be sharing
>> it
>>>> with my team!
>>>> 
>>>> -Rob
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
>>>> [email protected]> wrote:
>>>> 
>>>>> Hey I just published an article about the "Data Engineer" role in
>>> modern
>>>>> organizations and thought it could be of interest to this community.
>>>>> 
>>>>> https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
>>>>> 91be18f1e603#.5rkm4htnf
>>>>> 
>>>>> Max
>>>>> 
>>>> 
>>> 
>>

Re: Article: The Rise of the Data Engineer

Reply via email to