hi gaspare

it's very appreciated that you're willing to contribute. We're short of
bandwidth and expertise in spark solutions. If you could share your
knowledge by documenting your idea it would be very helpful to the
community.

As you suggested, I think writing down all solutions (on how to address
streaming data issue) would be a good idea. I'm very interested in your
spark+parquent solution, can you elaborate more in your design doc? We can
make further discussions based on it

I just have a simple input: Currently we use a unified SQL engine(Apache
calcite) as the query engine. We overwrote its execution plans so that it
will fetch data directly from our cubes(or inverted index). It not a good
idea to introduce another SQL engine(especially when a query wants both
data in historical cube and streaming data), It is preferable if data in
spark/parquet could exhibit similar interfaces to cube storages so that
current SQL query engine can consume.

On Thu, Sep 24, 2015 at 3:07 PM, Gaspare Maria <
[email protected]> wrote:

>
>
> Hi,
> It depends on how you organize the data. For example, where do you store
> facts ?
> If you store facts in hbase and build indexes on low selective columns
> (e.g month or day) then you will have too many gets on hbase.
> Why do not use spark streaming and spark dataframes ? You can save the
> latest data received via spark streaming as parquet and the use cached RDD
> to query as SQL. This is very fast you can join with dimensions tables
> (static or created on the received data) and you can offer also SQL
> interface via thrift.
> Then according to cube refresh policies update the cube from parquet files
> and remove the latest files as soon cube is updated.
> We should try to write a design document with all the proposed solutions
> by writing pro and cons of each proposed solutions.
> Even if I am busy on customer projects I can contribute to write a such
> document if you want, but someone should start to write the solutions
> already implemented.
> Regards,
> -- gas
>
>
>
> -------- Messaggio originale --------
> Da: Sarnath <[email protected]>
> Data: 24/09/2015  08:22  (GMT+01:00)
> A: [email protected]
> Oggetto: Re: 回复: Kylin Real time
>
> Hi,
>
> Can you share some reasons why "Inverted Index" did not work..
> Coz, I am precisely trying to do the same for storing cubes - in our own
> private implementation.
> Wondering - what problems are upstream?
>
> Thanks,
> Best,
> Sarnath
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Reply via email to