There's also monetdb and greenplum, depending on your data size, which https://www.monetdb.org, which support columnar tables if you want to get your feet wet. If your data is actually more array-like, you might try out scidb.
Per this email thread, it almost sounds like a slack team/discourse for data engineering might be useful. > On Jan 25, 2017, at 7:28 AM, Rob Goretsky <[email protected]> wrote: > > @Gerard - I mentioned Vertica just as one of the first examples of a system > that offers columnar storage. You might actually see a significant benefit > using columnar storage with even a smaller table, as small as a few GB - > Columnar storage works well if you have wide fact tables with many columns > and often query on just a few of those columns. The downside to columnar > storage is that if you often SELECT *, or many, of the columns from the > table at once, it will actually be slower than if you had stored the data > in traditional 'row-based' storage. Also, updates and deletes can be > slower with columnar storage, so it works best if you have wide, > INSERT-only fact tables. That said, I think there are better options than > Vertica on the market today for getting your feet wet with columnar > storage. If AWS is an option for you, then Redshift offers this out of the > box, and would let you run your POC for as little as $0.25 an hour. > Parquet is basically columnar storage for Hadoop.. Other more traditional > data warehouse vendors like Netezza and Teradata also offer columnar > storage as an option ... > >> On Wed, Jan 25, 2017 at 9:16 AM, Boris Tyukin <[email protected]> wrote: >> >> Max, really really nice post and I like your style of writing - please >> continue sharing your experience and inspire many of us working in more >> traditional environments ;) I shared your post with our leadership and >> hopefully we will have data engineers soon on our team! As far as UI vs. >> coding, I am not sure I fully agree as we look at software development >> history, we will see times when programming was the only answer and >> required hardcore professionals like you but then commercial applications >> which were very visual and lowered requirements to the skillset need. >> Informatica, SSIS and others became hugely popular and many people swear >> they save time if you know how to use them. I am pretty sure we will see >> new tools in Big Data arena as well (AtScale is one example) that make >> things easier for less skilled developers and users. >> >> It is also good timing for me as my company evaluating Informatica Big Data >> Management addon (which competes with Talend Big Data) - I am not sold yet >> on why we would need it if we can do much more with Python and Spark and >> Hive. But the key point Informatica folks make is to lower the requirements >> for the skills of developers and to leverage existing skills with >> Informatica and SQL. I think this is important because this is exactly why >> SQL is still a huge player in Big Data world - people love SQL, they can do >> a lot with SQL and they want to use their SQL experience they've built over >> their carrier. >> >> The dimensional modelling question you raised is also very interesting but >> very arguable. I was thinking about it before and still did not come to >> believe that flat tables is a way to go. You said it yourself that there is >> still a place for highly accurate (certified) enterprise wide warehouse and >> one still need to spend a lot of time thinking about use cases and design >> to support them. I am not sure I like the abundance of de-normilized tables >> in Big Data world but I do see your point about SCDs and all the pain to >> maintain a traditional star schema DW. But dimensional modelling is not >> really about maintenance or making life easier for ETL developers - IMHO it >> is about structuring data to simply business and data analytics. It is >> about rigorous process to conform data from multiple source systems. It is >> about data quality and trust. Finally it is about better performing DW (by >> nature of RDBMS which are very good at joining tables by foreign keys) - >> the last benefit though is not relevant in Hadoop since we can reprocess or >> query data more efficiently. >> >> Gerard, why would you do that? if you have the skills already with SQL >> Server and your DWH is tiny (I run 500Gb DWH in SQL Server on a weak >> machine), you should be fine with SQL Server. The only issue you cannot >> support fast BI queries. But you have enterprise license, you can easily >> dump your table in tabular in memory cube and most of your queries will be >> running in under 2 seconds. Vertica is cool but the learning curve is >> pretty steep and it really shines on big de-normalized tables as join >> performance might is not that good. I work with a large healthcare vendor >> and they have Tb size tables in their Vertica db - most of them are flatten >> out but they still have dimensions and facts, just less then you would >> normally have with traditional star schema design. >> >> >> >> On Wed, Jan 25, 2017 at 5:57 AM, Gerard Toonstra <[email protected]> >> wrote: >> >>> You mentioned Vertica and Parquet. Is it recommended to use these newer >>> tools even when the DWH is not BigData >>> size (about 150G in size) ? >>> >>> So there are a couple of good benefits, but are there any downsides and >>> disadvantages you have to take into account >>> comparing Vertica vs. SQL Server for example? >>> >>> If you really recommend Vertica over SQL Server, I'm looking at doing a >> PoC >>> here to see where it goes... >>> >>> Rgds, >>> >>> Gerard >>> >>> >>> On Wed, Jan 25, 2017 at 12:39 AM, Rob Goretsky < >> [email protected]> >>> wrote: >>> >>>> Maxime, >>>> Just wanted to thank you for writing this article - much like the >>> original >>>> articles by Jeff Hammerbacher and DJ Patil coining the term "Data >>>> Scientist", I feel this article stands as a great explanation of what >> the >>>> title of "Data Engineer" means today.. As someone who has been working >>> in >>>> this role before the title existed, many of the points here rang true >>> about >>>> how the technology and tools have evolved.. >>>> >>>> I started my career working with graphical ETL tools (Informatica) and >>>> could never shake the feeling that I could get a lot more done, with a >>> more >>>> maintainable set of processes, if I could just write reusable functions >>> in >>>> any programming language and then keep them in a shared library. >>> Instead, >>>> what the GUI tools forced upon us were massive Wiki documents laying >> out >>>> 'the 9 steps you need to follow perfectly in order to build a proper >>>> Informatica workflow' , that developers would painfully need to follow >>>> along with, rather than being able to encapsulate the things that >> didn't >>>> change in one central 'function' to pass in parameters for the things >>> that >>>> varied from the defaults. >>>> >>>> I also spent a lot of time early in my career trying to design data >>>> warehouse tables using the Kimball methodology with star schemas and >> all >>>> dimensions extracted out to separate dimension tables. As columnar >>> storage >>>> formats with compression became available (Vertica/Parquet/etc), I >>> started >>>> gravitating more towards the idea that I could just store the raw >> string >>>> dimension data in the fact table directly, denormalized, but it always >>> felt >>>> like I was breaking the 'purist' rules on how to design data warehouse >>>> schemas 'the right way'.. So in that regard, thanks for validating my >>>> feeling that its ok to keep denormalized dimension data directly in >> fact >>>> tables - it definitely makes our queries easier to write, and as you >>>> mentioned, has the added benefit of helping you avoid all of that SCD >>> fun! >>>> >>>> We're about to put Airflow into production at my company (MLB.com) for >> a >>>> handful of DAGs to start, so it will be running alongside our existing >>>> Informatica server running 500+ workflows nightly.. But I can already >>> see >>>> the writing on the wall - it's really hard for us to find talented >>>> engineers with Informatica experience along with more general computer >>>> engineering backgrounds (many seem to have specialized in purely >>>> Informatica) - so our newer engineers come in with strong Python/SQL >>>> backgrounds and have been gravitating towards building newer jobs in >>>> Airflow... >>>> >>>> One item that I think deserves addition to this article is the >> continuing >>>> prevalence of SQL. Many technologies have changed, but SQL has >>> persisted >>>> (pun intended?). We went through a phase for a few years where it >> looked >>>> like the tide was turning to MapReduce, Pig, or other languages for >>>> accessing and aggregating data.. But now it seems even the "NoSQL" >> data >>>> stores have added SQL layers on top, and we have more SQL engines for >>>> Hadoop than I can count. SQL is easy to learn but tougher to master, >> so >>>> to me the two main languages in any modern Data Engineer's toolbelt are >>> SQL >>>> and a scripting language (Python/Ruby).. I think it's amazing that >> with >>>> so much change in every aspect of how we do data warehousing, SQL has >>> stood >>>> the test of time... >>>> >>>> Anyways, thanks again for writing this up, I'll definitely be sharing >> it >>>> with my team! >>>> >>>> -Rob >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin < >>>> [email protected]> wrote: >>>> >>>>> Hey I just published an article about the "Data Engineer" role in >>> modern >>>>> organizations and thought it could be of interest to this community. >>>>> >>>>> https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer- >>>>> 91be18f1e603#.5rkm4htnf >>>>> >>>>> Max >>>>> >>>> >>> >>
