R: Re: 回复： Kylin Real time

Gaspare Maria Thu, 24 Sep 2015 14:25:48 -0700

    
Hi,
Point 2 is perfect. Update cubes in near realtime (e.g. every 5 minutes) is 
acceptable.
Doing it via spark streaming will be also fast with properly split of rows in 
multiple regions.



Gaspare Maria

-------- Messaggio originale --------
Da: hongbin ma <[email protected]> 
Data: 24/09/2015  05:40  (GMT+01:00) 
A: dev <[email protected]> 
Oggetto: Re: 回复： Kylin Real time 

Hi luke

I'm afraid you answer might be a little confusing to outside customers.
Cube, Streaming, and Inverted Index are not concepts in the same context.
My understanding is:

1. "Cube" or "Inverted Index" is the two options we store digested data.
This is what we allow modeler to specify data model. Cube is Kylin's
original choice for storage, and later we introduced "Inverted Index" in an
attempt to serve near real time requirements.(Because with v1 engine,
building cube process is very time consuming, whereas putting digested data
into inverted index is much faster), However development on Inverted Index
is paused due to several reasons.

2."Streaming" is a concept compared with "Batch". Before 2.x versions,
Kylin uses v1 engine to build cubes which only supports loading data from
hive tables in a batch fashion, this is why it is called "Batch" mode. In
2.x versions, we invented the new v2 engine and started to support building
cubes from streaming queues like Kafka. As previously explained, current
streaming solutions is not strictly "real time streaming" because it is
basically consuming the streaming data to build mini cubes.

On Wed, Sep 23, 2015 at 9:39 PM, Luke Han <[email protected]> wrote:

> Hi gaspare,
>     You have raised a great discussion about those things.
>     As orignial idea, there's only cube, but we come up a new concept: Data
> Model since "Cube" itself is just one storage.
>
>     There's one option for modelor to define/pickup which kind of storage
> for the Data Model, actually we call it
> as Realization interface for Cube, Streaming and Inverted Index
> and extensible for any others in the future.
>
>    So you are right, there's will be one UI setting part for Data Model for
> this which will come later since 2.x is under heavy refactoring and
> turning, just like Hongbin mentioned.
>
>     Please stay tuned for the latest update of streaming/realtime
> capability of Kylin.
>
>     Thanks.
>
> Luke
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Wed, Sep 23, 2015 at 2:55 PM, hongbin ma <[email protected]> wrote:
>
> > hi gaspare
> >
> > Actually we do have a similar solutions in the 2.x-staging code base. It
> is
> > called "Streaming Cubing" (In contrast to Inverted Index, it is using a
> > mini batch cubing solution to tackle the near real time problem)
> >
> > There will be daemon threads that starts up periodically to consume data
> > from the data batch (maybe five-minute batch) from Kafka, and build a
> > mini-cube in memory before writing it into HBase. We have not officially
> > announced the functionality because:
> >
> > 1. Currently we do not have front end UI to do the configurations,
> > including specifying Kafka configurations, etc. This makes  Streaming
> > Cubing difficult to use now. The good news is that we're actively working
> > on it (https://issues.apache.org/jira/browse/KYLIN-1041)
> > 2. Lack of Documentation
> > 3. Currently we have not leveraged spark streaming(or other alternatives)
> > to process the data batch. Our daemon thread is a simple java thread and
> it
> > will be problematic when the data batch grows too large. We intended to
> > migrate to horizontal scalable solutions like spark streaming, but havn't
> > got enough bandwidth to start it.(
> > https://issues.apache.org/jira/browse/KYLIN-1042)
> >
> > Anyway customers should be able to use Streaming cubing when we
> officially
> > annnouce 2.x versions.
> >
> >
> >
> >
> >
> > On Wed, Sep 23, 2015 at 6:00 AM, Gaspare Maria <
> > [email protected]> wrote:
> >
> > > Hi,
> > >
> > > one more question/feedback regarding Kylin Real time.
> > >
> > > There are many use-cases (in particular in the TELCO environment) where
> > > stream of data arrive at regular intervals (usually every 5 or 15
> > minutes)
> > > and "real-time" aggregations could be always done per intervals (for
> > > example SUM(upLink) per node in the last interval). In such use-cases
> the
> > > "maybe" the CUBE could be update in near real-time from after
> > > pre-aggregation with Spark Streaming (of course without create the
> HFiles
> > > but using parallel PUT on HBase with Spark). According to our
> experience
> > > for "simple" CUBEs this should be faster then Inverted Indexes.
> > >
> > > Of course there are use-cases where this approach is not applicable, in
> > > those cases Inverted Indexes are still valid.
> > >
> > > Should be good if Kylin will be able to give to the "CUBE
> Administrator"
> > > the possibility to choose how to do "Real-time CUBE Update". For
> example,
> > > give the option to  choose wither "Inverted Indexes" or "HBase".
> > >
> > > Do you think a such approach could be applicable to Kylin ?
> > >
> > > Regards,
> > >
> > > -- gas
> > >
> > >
> > >
> > > On 09/21/2015 11:36 AM, Li Yang wrote:
> > >
> > >> Gas is mostly right, with one addition that, query can hit both
> > >> inverted-index and cube if it asks for both latest and historic data.
> > The
> > >> result from two sources will get aggregated at query time.
> > >>
> > >> On Fri, Sep 18, 2015 at 11:26 PM, Gaspare Maria <
> > >> [email protected]> wrote:
> > >>
> > >> Hi,
> > >>>
> > >>> so if I understood the idea behind Kylin Real Time is:
> > >>>
> > >>>   *   Inverted Indexes (maybe Lucene or inverted indexes on HBase)
> will
> > >>>     be built according to CUBE Schema in near-realtime by using Spark
> > >>>     (streaming) Kafka Consumers;
> > >>>   * On query Time if the query impacts latest data it will be routed
> to
> > >>>     Inverted Indexes otherwise on the CUBE on HBase.
> > >>>   * Query that impacts latest data should be limited due to
> limitation
> > >>>     of inverted indexes;
> > >>>   * Query on long period of time back (e.g. from now back to 2 months
> > >>>     ago) will be routed part on HBase and part on Inverted Indexes.
> > >>>
> > >>>
> > >>> Am I right?
> > >>>
> > >>> Regards,
> > >>>
> > >>> -- gas
> > >>>
> > >>>
> > >>>
> > >>> On 09/18/2015 12:35 AM, Henry Saputra wrote:
> > >>>
> > >>> Awesome, thanks Luke
> > >>>>
> > >>>> On Thu, Sep 17, 2015 at 2:37 AM, Luke Han <[email protected]>
> wrote:
> > >>>>
> > >>>> Here's JIRA: https://issues.apache.org/jira/browse/KYLIN-599
> > >>>>>
> > >>>>>
> > >>>>> Best Regards!
> > >>>>> ---------------------
> > >>>>>
> > >>>>> Luke Han
> > >>>>>
> > >>>>> On Thu, Sep 17, 2015 at 1:09 AM, Henry Saputra <
> > >>>>> [email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>> That is good to know. Li Yang, Luke, could one of you share the
> > design
> > >>>>>
> > >>>>>> document for this realtime OLAP query in the JIRA?
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>>
> > >>>>>> - Henry
> > >>>>>>
> > >>>>>> On Tue, Sep 15, 2015 at 11:12 PM, Li Yang <[email protected]>
> > wrote:
> > >>>>>>
> > >>>>>> There will be incremental updates on the existing cubes, but
> during
> > >>>>>>>
> > >>>>>>>> that updates I suppose no queries will be ran against them?
> > >>>>>>>>
> > >>>>>>>> Yes, it's mini batch, usually at minutes interval. And of course
> > >>>>>>> cube
> > >>>>>>> CAN
> > >>>>>>> serve query while the mini incremental is under built. How can we
> > let
> > >>>>>>> the
> > >>>>>>> cube offline every few minutes, that's impossible.  :-)
> > >>>>>>>
> > >>>>>>> On Tue, Sep 15, 2015 at 6:41 PM, Sarnath <[email protected]>
> > wrote:
> > >>>>>>>
> > >>>>>>> Inverted index? That sounds interesting. We use inverted index to
> > >>>>>>> serve
> > >>>>>>> the
> > >>>>>>> cubes in our internal implementation.
> > >>>>>>>
> > >>>>>>>> I come from Big Data Center of excellence from an Indian IT
> major.
> > >>>>>>>>
> > >>>>>>>> We have been experimenting with the idea of serving cubes
> through
> > >>>>>>>> ElasticSearch REST API. This is not related to Kylin. This is
> our
> > >>>>>>>> own
> > >>>>>>>> internal development.
> > >>>>>>>>
> > >>>>>>>> The motivation for this is --- Once the cube is built, it needs
> to
> > >>>>>>>> be
> > >>>>>>>> served.
> > >>>>>>>>
> > >>>>>>>> The query looks somewhat like this:
> > >>>>>>>>
> > >>>>>>>> "Given ProductID=*, Year=2015, Fetch All Quantities Sold"
> > >>>>>>>>
> > >>>>>>>> "Given ProductID=XX, Fetch how much it has sold every Month"
> > >>>>>>>>
> > >>>>>>>> Find all entries that match K1=V1, K2=V2
> > >>>>>>>>
> > >>>>>>>> This relieves us from lot of things - storage, REST API etc. and
> > >>>>>>>> makes
> > >>>>>>>>
> > >>>>>>>> the
> > >>>>>>> cubes easily searchable.
> > >>>>>>>
> > >>>>>>>> However, we don't do SQL/MDX on top of it.  Tableau 9.1Beta is
> > >>>>>>>> experimenting with Web-Data-Connector which we believe can be
> used
> > >>>>>>>> for
> > >>>>>>>> Visualization... Apart from that, we experimented with a few
> > >>>>>>>> auto-generated Kibana dashboards which were just okay. But
> Kibana
> > >>>>>>>> was
> > >>>>>>>>
> > >>>>>>>> not
> > >>>>>>> designed for Cubes and so it has its own limitations.
> > >>>>>>>
> > >>>>>>>> Appreciate any feedback!
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>>
> > >>>>>>>> Sarnath
> > >>>>>>>> I also think that it's a mini batch cubing.   It's time to bring
> > >>>>>>>> back
> > >>>>>>>>
> > >>>>>>>> the
> > >>>>>>> inverted index into roadmap. The inverted index will be the true
> > >>>>>>> real-time
> > >>>>>>> solution and can provide the low-level query capability on the
> raw
> > >>>>>>>
> > >>>>>>>> data.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Thanks!
> > >>>>>>>> JiangXu
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> ------------------ 原始邮件 ------------------
> > >>>>>>>> 发件人: "Henry Saputra";<[email protected]>;
> > >>>>>>>> 发送时间: 2015年9月15日(星期二) 中午12:39
> > >>>>>>>> 收件人: "[email protected]"<
> > >>>>>>>> [email protected]
> > >>>>>>>>
> > >>>>>>>>> ;
> > >>>>>>>>>
> > >>>>>>>> 主题: Re: Kylin Real time
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Ok, but that still seems like mini batch to me.
> > >>>>>>>>
> > >>>>>>>> There will be incremental updates on the existing cubes, but
> > during
> > >>>>>>>> that updates I suppose no queries will be ran against them?
> > >>>>>>>>
> > >>>>>>>> - Henry
> > >>>>>>>>
> > >>>>>>>> On Mon, Sep 14, 2015 at 12:33 AM, Li Yang <[email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Streaming OLAP provides Near-Realtime analysis where data delay
> > can
> > >>>>>>>>>
> > >>>>>>>>> be as
> > >>>>>>>>
> > >>>>>>> short as a few minutes.
> > >>>>>>>
> > >>>>>>>> Traditional daily build allows user to analyze yesterday's data.
> > If
> > >>>>>>>>> increase the frequency to hourly, then user can analyze last
> > hour's
> > >>>>>>>>>
> > >>>>>>>>> data.
> > >>>>>>>>
> > >>>>>>> Further down the line, how about incremental build every 5
> minutes
> > >>>>>>>
> > >>>>>>>> from a
> > >>>>>>>>
> > >>>>>>> streaming source? Then user can analyze data 5 minutes ago.
> That's
> > >>>>>>>
> > >>>>>>>> Streaming OLAP!
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Sep 14, 2015 at 12:43 AM, Henry Saputra <
> > >>>>>>>>>
> > >>>>>>>>> [email protected]
> > >>>>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Luke,
> > >>>>>>>>>
> > >>>>>>>>>> Could you clarify again what is the streaming OLAP means here?
> > >>>>>>>>>>
> > >>>>>>>>>> By definition OLAP work with historical data.
> > >>>>>>>>>>
> > >>>>>>>>>> Maybe I missed it but was there any discussions or proposed
> > design
> > >>>>>>>>>>
> > >>>>>>>>>> for
> > >>>>>>>>>
> > >>>>>>>> it?
> > >>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>>>
> > >>>>>>>>>> - Henry
> > >>>>>>>>>>
> > >>>>>>>>>> On Monday, August 3, 2015, Luke Han <[email protected]>
> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Hi Siddharth,
> > >>>>>>>>>>
> > >>>>>>>>>>>       Kylin's next majority release (0.8.x) will support
> > >>>>>>>>>>> Streaming
> > >>>>>>>>>>>
> > >>>>>>>>>>> OLAP
> > >>>>>>>>>>
> > >>>>>>>>> which
> > >>>>>>>
> > >>>>>>>> will coming in Q4 since it still under development now, as
> Hongbin
> > >>>>>>>>>>> mentioned above.
> > >>>>>>>>>>>       Could  you please drop me a mail about your case? I
> would
> > >>>>>>>>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>>>>>
> > >>>>>>>>> better understand your scenario to well manage coming features?
> > >>>>>>>
> > >>>>>>>>       Thanks.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best Regards!
> > >>>>>>>>>>> ---------------------
> > >>>>>>>>>>>
> > >>>>>>>>>>> Luke Han
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Jul 29, 2015 at 2:08 PM, hongbin ma <
> > >>>>>>>>>>> [email protected]
> > >>>>>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> For current 0.7  releases, you cannot.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Real time data processing and querying will be added in 0.8
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> release.
> > >>>>>>>>>>>
> > >>>>>>>>>> It
> > >>>>>>>
> > >>>>>>>> is
> > >>>>>>>>>
> > >>>>>>>>>> still under development and testing. We have achieved good
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> progress
> > >>>>>>>>>>>
> > >>>>>>>>>> on
> > >>>>>>>
> > >>>>>>>> it,
> > >>>>>>>>>
> > >>>>>>>>>> please wait for announcements.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Jul 29, 2015 at 2:02 PM, Siddharth Ubale <
> > >>>>>>>>>>>> [email protected] <javascript:;>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi ,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> I would like to ask whether Kylin can be used as a real
> time
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> querying
> > >>>>>>>>>>>>
> > >>>>>>>>>>> system?
> > >>>>>>>>>
> > >>>>>>>>>> The process of building a cube , makes it look like a batch
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> process
> > >>>>>>>>>>>>
> > >>>>>>>>>>> after
> > >>>>>>>>>
> > >>>>>>>>>> which the queries are with low latency.. however can
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> We get a real time idea of what the OLAP system's state is
> at
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>> query
> > >>>>>>>
> > >>>>>>>> instance?
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>> Siddharth
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> *Bin Mahone | 马洪宾*
> > >>>>>>>>>>>> Apache Kylin: http://kylin.io
> > >>>>>>>>>>>> Github: https://github.com/binmahone
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

R: Re: 回复： Kylin Real time

Reply via email to