Druid quick comparision

Nam Đỗ Duy Mon, 04 Dec 2023 18:30:56 -0800

Thank you Xiaoxiang, please update me when you have changed your default
branch. In case people are impressed by the numbers then I hope to turn
this situation to reverse direction.


On Tue, Dec 5, 2023 at 9:02 AM Xiaoxiang Yu <x...@apache.org> wrote:

> The default branch is for 4.X which is a maintained branch, the active
> branch is kylin5.
> I will change the default branch to kylin5 later.
>
> ------------------------
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Tue, Dec 5, 2023 at 9:12 AM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote:
>
>> Hi Xiaoxiang, Sirs / Madams
>>
>> Can you see the atttached photo
>>
>> My boss asked that why druid commit code regularly but kylin had not been
>> committed since July
>>
>>
>> On Mon, 4 Dec 2023 at 15:33 Xiaoxiang Yu <x...@apache.org> wrote:
>>
>>> I think so.
>>>
>>> Response time is not the only factor to make a decision. Kylin could be
>>> cheaper
>>> when the query pattern is suitable for the Kylin model, and Kylin can
>>> guarantee
>>> reasonable query latency. Clickhouse will be quicker in an ad hoc query
>>> scenario.
>>>
>>> By the way, Youzan and Kyligence combine them together to provide
>>> unified data analytics services for their customers.
>>>
>>> ------------------------
>>> With warm regard
>>> Xiaoxiang Yu
>>>
>>>
>>>
>>> On Mon, Dec 4, 2023 at 4:01 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
>>> wrote:
>>>
>>>> Hi Xiaoxiang, thank you
>>>>
>>>> In case my client uses cloud computing service like gcp or aws, which
>>>> will cost more: precalculation feature of kylin or clickhouse (incase of
>>>> kylin, I have a thought that the query execution has been done once and
>>>> stored in cube to be used many times so kylin uses less cloud
>>>> computation,
>>>> is that true)?
>>>>
>>>> On Mon, Dec 4, 2023 at 2:46 PM Xiaoxiang Yu <x...@apache.org> wrote:
>>>>
>>>> > Following text is part of an article(
>>>> > https://zhuanlan.zhihu.com/p/343394287) .
>>>> >
>>>> >
>>>> >
>>>> ===============================================================================
>>>> >
>>>> > Kylin is suitable for aggregation queries with fixed modes because of
>>>> its
>>>> > pre-calculated technology, for example, join, group by, and where
>>>> condition
>>>> > modes in SQL are relatively fixed, etc. The larger the data volume
>>>> is, the
>>>> > more obvious the advantages of using Kylin are; in particular, Kylin
>>>> is
>>>> > particularly advantageous in the scenarios of de-emphasis (count
>>>> distinct),
>>>> > Top N, and Percentile. In particular, Kylin's advantages in
>>>> de-weighting
>>>> > (count distinct), Top N, Percentile and other scenarios are especially
>>>> > huge, and it is used in a large number of scenarios, such as
>>>> Dashboard, all
>>>> > kinds of reports, large-screen display, traffic statistics, and user
>>>> > behavior analysis. Meituan, Aurora, Shell Housing, etc. use Kylin to
>>>> build
>>>> > their data service platforms, providing millions to tens of millions
>>>> of
>>>> > queries per day, and most of the queries can be completed within 2 - 3
>>>> > seconds. There is no better alternative for such a high concurrency
>>>> > scenario.
>>>> >
>>>> > ClickHouse, because of its MPP architecture, has high computing power
>>>> and
>>>> > is more suitable when the query request is more flexible, or when
>>>> there is
>>>> > a need for detailed queries with low concurrency. Scenarios include:
>>>> very
>>>> > many columns and where conditions are arbitrarily combined with the
>>>> user
>>>> > label filtering, not a large amount of concurrency of complex
>>>> on-the-spot
>>>> > query and so on. If the amount of data and access is large, you need
>>>> to
>>>> > deploy a distributed ClickHouse cluster, which is a higher challenge
>>>> for
>>>> > operation and maintenance.
>>>> >
>>>> > If some queries are very flexible but infrequent, it is more
>>>> > resource-efficient to use now-computing. Since the number of queries
>>>> is
>>>> > small, even if each query consumes a lot of computational resources,
>>>> it is
>>>> > still cost-effective overall. If some queries have a fixed pattern
>>>> and the
>>>> > query volume is large, it is more suitable for Kylin, because the
>>>> query
>>>> > volume is large, and by using large computational resources to save
>>>> the
>>>> > results, the upfront computational cost can be amortized over each
>>>> query,
>>>> > so it is the most economical.
>>>> >
>>>> > --- Translated with DeepL.com (free version)
>>>> >
>>>> >
>>>> > ------------------------
>>>> > With warm regard
>>>> > Xiaoxiang Yu
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Dec 4, 2023 at 3:16 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
>>>> wrote:
>>>> >
>>>> >> Thank you Xiaoxiang for the near real time streaming feature. That's
>>>> >> great.
>>>> >>
>>>> >> This morning there has been a new challenge to my team: clickhouse
>>>> offered
>>>> >> us the speed of calculating 8 billion rows in millisecond which is
>>>> faster
>>>> >> than my demonstration (I used Kylin to do calculating 1 billion rows
>>>> in
>>>> >> 2.9
>>>> >> seconds)
>>>> >>
>>>> >> Can you briefly suggest the advantages of kylin over clickhouse so
>>>> that I
>>>> >> can defend my demonstration.
>>>> >>
>>>> >> On Mon, Dec 4, 2023 at 1:55 PM Xiaoxiang Yu <x...@apache.org> wrote:
>>>> >>
>>>> >> > 1. "In this important scenario of realtime analytics, the reason
>>>> here is
>>>> >> > that
>>>> >> > kylin has lag time due to model update of new segment build, is
>>>> that
>>>> >> > correct?"
>>>> >> >
>>>> >> > You are correct.
>>>> >> >
>>>> >> > 2. "If that is true, then can you suggest a work-around of
>>>> combination
>>>> >> of
>>>> >> > ... "
>>>> >> >
>>>> >> > Kylin is planning to introduce NRT streaming(coding is completed
>>>> but not
>>>> >> > released),
>>>> >> > which can make the time-lag to about 3 minutes(that is my
>>>> estimation
>>>> >> but I
>>>> >> > am
>>>> >> > quite certain about it).
>>>> >> > NRT stands for 'near real-time', it will run a job and do
>>>> micro-batch
>>>> >> > aggregation and persistence periodically. The price is that you
>>>> need to
>>>> >> run
>>>> >> > and monitor a long-running
>>>> >> >  job. This feature is based on Spark Streaming, so you need
>>>> knowledge of
>>>> >> > it.
>>>> >> >
>>>> >> > I am curious about what is the maximum time-lag your customers
>>>> >> > can tolerate?
>>>> >> > Personally, I guess minute level time-lag is ok for most cases.
>>>> >> >
>>>> >> > ------------------------
>>>> >> > With warm regard
>>>> >> > Xiaoxiang Yu
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > On Mon, Dec 4, 2023 at 12:28 PM Nam Đỗ Duy <na...@vnpay.vn.invalid
>>>> >
>>>> >> wrote:
>>>> >> >
>>>> >> > > Druid is better in
>>>> >> > > - Have a real-time datasource like Kafka etc.
>>>> >> > >
>>>> >> > > ==========================
>>>> >> > >
>>>> >> > > Hi Xiaoxiang, thank you for your response.
>>>> >> > >
>>>> >> > > In this important scenario of realtime alalytics, the reason
>>>> here is
>>>> >> that
>>>> >> > > kylin has lag time due to model update of new segment build, is
>>>> that
>>>> >> > > correct?
>>>> >> > >
>>>> >> > > If that is true, then can you suggest a work-around of
>>>> combination of
>>>> >> :
>>>> >> > >
>>>> >> > > (time - lag kylin cube) + (realtime DB update) to provide
>>>> >> > > realtime capability ?
>>>> >> > >
>>>> >> > > IMO, the point here is to find that (realtime DB update) and
>>>> >> integrate it
>>>> >> > > with (time - lag kylin cube).
>>>> >> > >
>>>> >> > > On Fri, Dec 1, 2023 at 1:53 PM Xiaoxiang Yu <x...@apache.org>
>>>> wrote:
>>>> >> > >
>>>> >> > > > I researched and tested Druid two years ago(I don't know too
>>>> much
>>>> >> about
>>>> >> > > >  the change of Druid in these two years. New features that I
>>>> know
>>>> >> are :
>>>> >> > > > new UI, fully on K8s etc).
>>>> >> > > >
>>>> >> > > > Here are some cases you should consider using Druid other than
>>>> Kylin
>>>> >> > > > at the moment (using Kylin 5.0-beta to compare the Druid which
>>>> I
>>>> >> used
>>>> >> > two
>>>> >> > > > years ago):
>>>> >> > > >
>>>> >> > > > - Have a real-time datasource like Kafka etc.
>>>> >> > > > - Most queries are small(Based on my test result, I think
>>>> Druid had
>>>> >> > > better
>>>> >> > > > response time for small queries two years ago.)
>>>> >> > > > - Don't know how to optimize Spark/Hadoop, want to use the
>>>> >> K8S/public
>>>> >> > > >   cloud platform as your deployment platform.
>>>> >> > > >
>>>> >> > > > But I do think there are many scenarios in which Kylin could be
>>>> >> better,
>>>> >> > > > like:
>>>> >> > > >
>>>> >> > > > - Better performance for complex/big queries. Kylin can have a
>>>> more
>>>> >> > > > exact-match/fine-grained
>>>> >> > > >   Index for queries containing different `Group By dimensions`.
>>>> >> > > > - User-friendly UI for modeling.
>>>> >> > > > - Support 'Join' better? (Not sure at the moment)
>>>> >> > > > - ODBC driver for different BI.(its website did not show it
>>>> supports
>>>> >> > ODBC
>>>> >> > > > well)
>>>> >> > > > - Looks like Kylin supports ANSI SQL better than Druid.
>>>> >> > > >
>>>> >> > > >
>>>> >> > > > I don't know Pinot, so I have nothing to say about it.
>>>> >> > > > Hope to help you, or you are free to share your opinion.
>>>> >> > > >
>>>> >> > > > ------------------------
>>>> >> > > > With warm regard
>>>> >> > > > Xiaoxiang Yu
>>>> >> > > >
>>>> >> > > >
>>>> >> > > >
>>>> >> > > > On Fri, Dec 1, 2023 at 11:11 AM Nam Đỗ Duy
>>>> <na...@vnpay.vn.invalid>
>>>> >> > > wrote:
>>>> >> > > >
>>>> >> > > >> Dear Xiaoxiang,
>>>> >> > > >> Sirs/Madams,
>>>> >> > > >>
>>>> >> > > >> May I post my boss's question:
>>>> >> > > >>
>>>> >> > > >> What are the pros and cons of the OLAP platform Kylin
>>>> compared to
>>>> >> > Pinot
>>>> >> > > >> and
>>>> >> > > >> Druid?
>>>> >> > > >>
>>>> >> > > >> Please kindly let me know
>>>> >> > > >>
>>>> >> > > >> Thank you very much and best regards
>>>> >> > > >>
>>>> >> > > >
>>>> >> > >
>>>> >> >
>>>> >>
>>>> >
>>>>
>>>

Re: Pinot/Kylin/Druid quick comparision

Reply via email to