hi gaspare it's very appreciated that you're willing to contribute. We're short of bandwidth and expertise in spark solutions. If you could share your knowledge by documenting your idea it would be very helpful to the community.
As you suggested, I think writing down all solutions (on how to address streaming data issue) would be a good idea. I'm very interested in your spark+parquent solution, can you elaborate more in your design doc? We can make further discussions based on it I just have a simple input: Currently we use a unified SQL engine(Apache calcite) as the query engine. We overwrote its execution plans so that it will fetch data directly from our cubes(or inverted index). It not a good idea to introduce another SQL engine(especially when a query wants both data in historical cube and streaming data), It is preferable if data in spark/parquet could exhibit similar interfaces to cube storages so that current SQL query engine can consume. On Thu, Sep 24, 2015 at 3:07 PM, Gaspare Maria < [email protected]> wrote: > > > Hi, > It depends on how you organize the data. For example, where do you store > facts ? > If you store facts in hbase and build indexes on low selective columns > (e.g month or day) then you will have too many gets on hbase. > Why do not use spark streaming and spark dataframes ? You can save the > latest data received via spark streaming as parquet and the use cached RDD > to query as SQL. This is very fast you can join with dimensions tables > (static or created on the received data) and you can offer also SQL > interface via thrift. > Then according to cube refresh policies update the cube from parquet files > and remove the latest files as soon cube is updated. > We should try to write a design document with all the proposed solutions > by writing pro and cons of each proposed solutions. > Even if I am busy on customer projects I can contribute to write a such > document if you want, but someone should start to write the solutions > already implemented. > Regards, > -- gas > > > > -------- Messaggio originale -------- > Da: Sarnath <[email protected]> > Data: 24/09/2015 08:22 (GMT+01:00) > A: [email protected] > Oggetto: Re: 回复: Kylin Real time > > Hi, > > Can you share some reasons why "Inverted Index" did not work.. > Coz, I am precisely trying to do the same for storing cubes - in our own > private implementation. > Wondering - what problems are upstream? > > Thanks, > Best, > Sarnath > -- Regards, *Bin Mahone | 马洪宾* Apache Kylin: http://kylin.io Github: https://github.com/binmahone
