Need advice for kylin newbie

Vikram Kone Fri, 27 Feb 2015 08:56:00 -0800

Hi,
I'm a newbie when it comes to Kylin and Hadoop eco system in general. Our
team has been predominantly a Microsoft shop that uses MS stack for most of
their BI needs. So we are talking SQL server  for storing relational data
and SQL Server Analysis services for building MOLAP cubes for sub-second
query analysis.
Lately, we have been hitting degradation in our cube query response times
as our data sizes grew considerably the past year. We are talking fact
tables which are in 1o-100 billions of rows range and a few dimensions in
the 10-100's of millions of rows. We tried vertically scaling up our SSAS
server but queries are still taking few minutes. In light of this, I was
entrusted with task of figuring out an open source solution that would
scale to our current and future needs for data analysis.
I looked at a bunch of open source tools like Apache Drill, Druid, AtScale,
Spark, Storm, Kylin etc and settled on exploring kylin  as the first step
given it's recent rise in popularity and growing eco-system around it.
I started to build out a POC for our MOLAP cubes using kylin with HDFS/Hive
as the datasource and see how it scales for our queries/measures in real
time with real data. The setup has been a nightmare so far. Configuration
of the cluster takes too long. I tried the docker version and it fails with
cryptic errors. Then tried installing it using the build from root option
on a hadopp cluster and seeing more issues while building issues related to
cube building. Same with binary package installation. It's just taking too
long to set up. There should be an easier way to do this :(
Roughly, these are the requirements for our team
1. Should be able to create facts, dimensions and measures from our data
sets in an easier way.
2. Cubes should be query able from Excel and Tableau.
3. Easily scale out by adding new nodes when data grows
4. Very less maintenance and highly stable for production level workloads
5. Sub second query latencies for COUNT DISTINCT measures (since majority
of our expensive measures are of this type) . Are ok with Approx Distinct
counts for better perf.


So given these requirements, is Kylin the right solution to replace our
on-premise MOLAP cubes?  As long as our users can pivot/slice & dice the
measures quickly from client tools like excel ND tableau by dragging
dropping dimensions into rows/columns w/o the need to join to fact table,
we are ok with however the data is laid out. Doesn't have to be a cube. It
can be a flat file in hdfs for all we care. I would love to chat with some
one who has successfully done this kind of migration from SSAS OLAP cubes
to KYLIN  in their team or company AND learn about pros n cons before I
spend more time Co figuring this stuff.

This is it for now. Looking forward to a great discussion.

P.S. We have decided on using Azure as our managed hadoop system in the
cloud.

Need advice for kylin newbie

Reply via email to