Druid quick comparision

Xiaoxiang Yu Mon, 04 Dec 2023 18:03:22 -0800

The default branch is for 4.X which is a maintained branch, the active
branch is kylin5.
I will change the default branch to kylin5 later.


------------------------
With warm regard
Xiaoxiang Yu



On Tue, Dec 5, 2023 at 9:12 AM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote:

> Hi Xiaoxiang, Sirs / Madams
>
> Can you see the atttached photo
>
> My boss asked that why druid commit code regularly but kylin had not been
> committed since July
>
>
> On Mon, 4 Dec 2023 at 15:33 Xiaoxiang Yu <x...@apache.org> wrote:
>
>> I think so.
>>
>> Response time is not the only factor to make a decision. Kylin could be
>> cheaper
>> when the query pattern is suitable for the Kylin model, and Kylin can
>> guarantee
>> reasonable query latency. Clickhouse will be quicker in an ad hoc query
>> scenario.
>>
>> By the way, Youzan and Kyligence combine them together to provide
>> unified data analytics services for their customers.
>>
>> ------------------------
>> With warm regard
>> Xiaoxiang Yu
>>
>>
>>
>> On Mon, Dec 4, 2023 at 4:01 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote:
>>
>>> Hi Xiaoxiang, thank you
>>>
>>> In case my client uses cloud computing service like gcp or aws, which
>>> will cost more: precalculation feature of kylin or clickhouse (incase of
>>> kylin, I have a thought that the query execution has been done once and
>>> stored in cube to be used many times so kylin uses less cloud
>>> computation,
>>> is that true)?
>>>
>>> On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu <x...@apache.org> wrote:
>>>
>>> > Following text is part of an article(
>>> > https://zhuanlan.zhihu.com/p/343394287) .
>>> >
>>> >
>>> >
>>> ===============================================================================
>>> >
>>> > Kylin is suitable for aggregation queries with fixed modes because of
>>> its
>>> > pre-calculated technology, for example, join, group by, and where
>>> condition
>>> > modes in SQL are relatively fixed, etc. The larger the data volume is,
>>> the
>>> > more obvious the advantages of using Kylin are; in particular, Kylin is
>>> > particularly advantageous in the scenarios of de-emphasis (count
>>> distinct),
>>> > Top N, and Percentile. In particular, Kylin's advantages in
>>> de-weighting
>>> > (count distinct), Top N, Percentile and other scenarios are especially
>>> > huge, and it is used in a large number of scenarios, such as
>>> Dashboard, all
>>> > kinds of reports, large-screen display, traffic statistics, and user
>>> > behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin to
>>> build
>>> > their data service platforms, providing millions to tens of millions of
>>> > queries per day, and most of the queries can be completed within 2 - 3
>>> > seconds. There is no better alternative for such a high concurrency
>>> > scenario.
>>> >
>>> > ClickHouse, because of its MPP architecture, has high computing power
>>> and
>>> > is more suitable when the query request is more flexible, or when
>>> there is
>>> > a need for detailed queries with low concurrency. Scenarios include:
>>> very
>>> > many columns and where conditions are arbitrarily combined with the
>>> user
>>> > label filtering, not a large amount of concurrency of complex
>>> on-the-spot
>>> > query and so on. If the amount of data and access is large, you need to
>>> > deploy a distributed ClickHouse cluster, which is a higher challenge
>>> for
>>> > operation and maintenance.
>>> >
>>> > If some queries are very flexible but infrequent, it is more
>>> > resource-efficient to use now-computing. Since the number of queries is
>>> > small, even if each query consumes a lot of computational resources,
>>> it is
>>> > still cost-effective overall. If some queries have a fixed pattern and
>>> the
>>> > query volume is large, it is more suitable for Kylin, because the query
>>> > volume is large, and by using large computational resources to save the
>>> > results, the upfront computational cost can be amortized over each
>>> query,
>>> > so it is the most economical.
>>> >
>>> > --- Translated with DeepL.com (free version)
>>> >
>>> >
>>> > ------------------------
>>> > With warm regard
>>> > Xiaoxiang Yu
>>> >
>>> >
>>> >
>>> > On Mon, Dec 4, 2023 at 3:16 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
>>> wrote:
>>> >
>>> >> Thank you Xiaoxiang for the near real time streaming feature. That's
>>> >> great.
>>> >>
>>> >> This morning there has been a new challenge to my team: clickhouse
>>> offered
>>> >> us the speed of calculating 8 billion rows in millisecond which is
>>> faster
>>> >> than my demonstration (I used Kylin to do calculating 1 billion rows
>>> in
>>> >> 2.9
>>> >> seconds)
>>> >>
>>> >> Can you briefly suggest the advantages of kylin over clickhouse so
>>> that I
>>> >> can defend my demonstration.
>>> >>
>>> >> On Mon, Dec 4, 2023 at 1:55 PM Xiaoxiang Yu <x...@apache.org> wrote:
>>> >>
>>> >> > 1. "In this important scenario of realtime analytics, the reason
>>> here is
>>> >> > that
>>> >> > kylin has lag time due to model update of new segment build, is that
>>> >> > correct?"
>>> >> >
>>> >> > You are correct.
>>> >> >
>>> >> > 2. "If that is true, then can you suggest a work-around of
>>> combination
>>> >> of
>>> >> > ... "
>>> >> >
>>> >> > Kylin is planning to introduce NRT streaming(coding is completed
>>> but not
>>> >> > released),
>>> >> > which can make the time-lag to about 3 minutes(that is my estimation
>>> >> but I
>>> >> > am
>>> >> > quite certain about it).
>>> >> > NRT stands for 'near real-time', it will run a job and do
>>> micro-batch
>>> >> > aggregation and persistence periodically. The price is that you
>>> need to
>>> >> run
>>> >> > and monitor a long-running
>>> >> >  job. This feature is based on Spark Streaming, so you need
>>> knowledge of
>>> >> > it.
>>> >> >
>>> >> > I am curious about what is the maximum time-lag your customers
>>> >> > can tolerate?
>>> >> > Personally, I guess minute level time-lag is ok for most cases.
>>> >> >
>>> >> > ------------------------
>>> >> > With warm regard
>>> >> > Xiaoxiang Yu
>>> >> >
>>> >> >
>>> >> >
>>> >> > On Mon, Dec 4, 2023 at 12:28 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
>>> >> wrote:
>>> >> >
>>> >> > > Druid is better in
>>> >> > > - Have a real-time datasource like Kafka etc.
>>> >> > >
>>> >> > > ==========================
>>> >> > >
>>> >> > > Hi Xiaoxiang, thank you for your response.
>>> >> > >
>>> >> > > In this important scenario of realtime alalytics, the reason here
>>> is
>>> >> that
>>> >> > > kylin has lag time due to model update of new segment build, is
>>> that
>>> >> > > correct?
>>> >> > >
>>> >> > > If that is true, then can you suggest a work-around of
>>> combination of
>>> >> :
>>> >> > >
>>> >> > > (time - lag kylin cube) + (realtime DB update) to provide
>>> >> > > realtime capability ?
>>> >> > >
>>> >> > > IMO, the point here is to find that (realtime DB update) and
>>> >> integrate it
>>> >> > > with (time - lag kylin cube).
>>> >> > >
>>> >> > > On Fri, Dec 1, 2023 at 1:53 PM Xiaoxiang Yu <x...@apache.org>
>>> wrote:
>>> >> > >
>>> >> > > > I researched and tested Druid two years ago(I don't know too
>>> much
>>> >> about
>>> >> > > >  the change of Druid in these two years. New features that I
>>> know
>>> >> are :
>>> >> > > > new UI, fully on K8s etc).
>>> >> > > >
>>> >> > > > Here are some cases you should consider using Druid other than
>>> Kylin
>>> >> > > > at the moment (using Kylin 5.0-beta to compare the Druid which I
>>> >> used
>>> >> > two
>>> >> > > > years ago):
>>> >> > > >
>>> >> > > > - Have a real-time datasource like Kafka etc.
>>> >> > > > - Most queries are small(Based on my test result, I think Druid
>>> had
>>> >> > > better
>>> >> > > > response time for small queries two years ago.)
>>> >> > > > - Don't know how to optimize Spark/Hadoop, want to use the
>>> >> K8S/public
>>> >> > > >   cloud platform as your deployment platform.
>>> >> > > >
>>> >> > > > But I do think there are many scenarios in which Kylin could be
>>> >> better,
>>> >> > > > like:
>>> >> > > >
>>> >> > > > - Better performance for complex/big queries. Kylin can have a
>>> more
>>> >> > > > exact-match/fine-grained
>>> >> > > >   Index for queries containing different `Group By dimensions`.
>>> >> > > > - User-friendly UI for modeling.
>>> >> > > > - Support 'Join' better? (Not sure at the moment)
>>> >> > > > - ODBC driver for different BI.(its website did not show it
>>> supports
>>> >> > ODBC
>>> >> > > > well)
>>> >> > > > - Looks like Kylin supports ANSI SQL better than Druid.
>>> >> > > >
>>> >> > > >
>>> >> > > > I don't know Pinot, so I have nothing to say about it.
>>> >> > > > Hope to help you, or you are free to share your opinion.
>>> >> > > >
>>> >> > > > ------------------------
>>> >> > > > With warm regard
>>> >> > > > Xiaoxiang Yu
>>> >> > > >
>>> >> > > >
>>> >> > > >
>>> >> > > > On Fri, Dec 1, 2023 at 11:11 AM Nam Đỗ Duy
>>> <na...@vnpay.vn.invalid>
>>> >> > > wrote:
>>> >> > > >
>>> >> > > >> Dear Xiaoxiang,
>>> >> > > >> Sirs/Madams,
>>> >> > > >>
>>> >> > > >> May I post my boss's question:
>>> >> > > >>
>>> >> > > >> What are the pros and cons of the OLAP platform Kylin compared
>>> to
>>> >> > Pinot
>>> >> > > >> and
>>> >> > > >> Druid?
>>> >> > > >>
>>> >> > > >> Please kindly let me know
>>> >> > > >>
>>> >> > > >> Thank you very much and best regards
>>> >> > > >>
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> >
>>>
>>

Re: Pinot/Kylin/Druid quick comparision

Reply via email to