Re: How to reflect last hour data into Hive and Kylin Insights query window

Xiaoxiang Yu Wed, 22 Nov 2023 02:06:25 -0800

It is a good question, I can share some articles with you.

1. How to build a metric repository by Kylin to share among data teams (DA,
DS, AI), is that the usage of measure in Kylin?


I think the metric repository(or metrics store) is actually which Kylin can
help. For example,
Beike(ke.com) did create an indicator/metrics platform whose backend is
Kylin. They created a metrics
store on the top of Kylin.

The architecture looks like this
https://mmbiz.qpic.cn/mmbiz_png/9xAoGyC249Kd9icMaNT1Gs7AlDAZic7PScYNCOkSQF8PqbuSLicoxhdk4w3kJtC0bms4FzW6iby08bNiaVsUzUkBPmg/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1


Here is technical article which wrote in Chinese about it(I am sorry this
is not translated):
 https://mp.weixin.qq.com/s/hsGjuaYfEfParcgTimBLnw


2. How to use Kylin for the Customer segmentation of Marketing dept?

Here are some articles : (sorry again for these are not translated)
https://kylin.apache.org/blog/2016/11/28/intersect-count/
https://zhuanlan.zhihu.com/p/100131550
https://cn.kyligence.io/blog/kylin-chinagreentown-user-portrait-2/
https://cn.kyligence.io/blog/apache-kylin-count-distinct-application-in-user-behavior-analysis/
https://www.infoq.cn/article/xZYe1DUopNA9CzLwau3O

You can send your presentation material to me if you are willing to share.

------------------------
With warm regard
Xiaoxiang Yu



On Wed, Nov 22, 2023 at 5:36 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote:

> Thank you Xiaoxiang, tomorrow noon is my presentation to the management
> about kylin so I am pending this issue to focus on following ones, can you
> please advise:
>
> 1. How to build a metric repository by Kylin to share among data teams (DA,
> DS, AI), is that the usage of measure in Kylin?
> 2. How to use Kylin for the Customer segmentation of Marketing dept?
>
>
> On Wed, Nov 22, 2023 at 2:10 PM Xiaoxiang Yu <x...@apache.org> wrote:
>
> > Before you try again, you can use spark-sql/spark-shell to check if the
> > data is loaded
> > into your table successfully (or if your data is copied to the right
> > place).
> > Following is how to start a spark-sql/spark-shell in a container.
> >
> > export HADOOP_CONF_DIR=/opt/hadoop-3.2.1/etc/hadoop
> >
> > cd /home/kylin/apache-kylin-5.0.0-beta-bin/spark
> >
> > bin/spark-shell --executor-cores 1 --num-executors 1 --master yarn
> >
> >
> > The result of spark-sql/spark-shell should be the same as your
> > saw in Kylin insight page. If there are different results for the same
> > query,
> > which should not happen, please let me know.
> >
> > Hope you can fix your problem soon.
> >
> > ------------------------
> > With warm regard
> > Xiaoxiang Yu
> >
> >
> >
> > On Wed, Nov 22, 2023 at 11:59 AM Nam Đỗ Duy <na...@vnpay.vn.invalid>
> > wrote:
> >
> > > Thank you Xiaoxiang, I tried in my place and it worked for the ssb
> > database
> > > but it didn't work for my own database.
> > >
> > > It only works if I restart kylin so I guess there might be some
> > > configuration miss in my end.
> > >
> > > Thank you very much anyway and will update next time.
> > >
> > > Have a good day.
> > >
> > > On Fri, Nov 17, 2023 at 5:34 PM Xiaoxiang Yu <x...@apache.org> wrote:
> > >
> > > > I did an easy test to verify if kylin has any bugs for the push down
> > > > function. And the push
> > > > down function works as expected without any mistakes. So I'm 99%
> > certain
> > > > that
> > > > your step "I loaded the incremental data into Hive already" does not
> > > work.
> > > >
> > > > Here are my steps(you can reproduce in a fresh Kylin5 docker
> container
> > in
> > > > one minute) :
> > > >
> > > > 1. Query `select count(*) from SSB.DATES` in project ssb without
> > building
> > > > any index.
> > > >     Query result(Answered By: HIVE) is :   2556
> > > >
> > > > 2. Duplicate the file of table `ssb.dates` by following command:
> > > >     hadoop fs -cp /user/hive/warehouse/ssb.db/dates/SSB.DATES.csv
> > > > /user/hive/warehouse/ssb.db/dates/SSB.DATES-2.csv
> > > >
> > > > 3. Re-query `select count(*) from SSB.DATES` in project ssb
> > > >     Query result(Answered By: HIVE) is :  5112
> > > >
> > > > So, it is clear that the second query incremental data can be found
> by
> > > the
> > > > Kylin query engine.
> > > >
> > > > Finally, to make good use of Kylin in real use cases, good knowledge
> of
> > > > Apache Spark
> > > > and Apache Hadoop is a must-to-have.
> > > >
> > > > ------------------------
> > > > With warm regard
> > > > Xiaoxiang Yu
> > > >
> > > >
> > > >
> > > > On Fri, Nov 17, 2023 at 5:52 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
> > > wrote:
> > > >
> > > > > Have a nice weekend Xiaoxiang, and thank you for helping me to
> > become a
> > > > > kylin's fan
> > > > >
> > > > > You are right I am not familiar with Kylin enough and have little
> > > > > background of the hadoop system so I will double check here
> carefully
> > > > > before
> > > > > future questions. However I did understand the following mechanism
> > > > > in quotes.
> > > > >
> > > > > ============quoted====================
> > > > >
> > > > > If incremental data is not loaded into Kylin, Kylin can still
> answer
> > > such
> > > > > queries by
> > > > > reading the original hive table, but the query is not accelerated.
> > > > >
> > > > > If incremental data is loaded into Kylin, Kylin can answer queries
> by
> > > > > reading the special Index/Cuboid files, and the query will be
> > > > accelerated.
> > > > >
> > > > > ============end====================
> > > > >
> > > > > I explain my previous question that was as follows:
> > > > >
> > > > > 1. I turned off this configuration kylin.query.cache-enabled (set =
> > > > false)
> > > > > 2. Restart Kylin
> > > > > 3. I loaded the incremental data into Hive already
> > > > > 4. Turn on Pushdown option to query Hive not model
> > > > > 5. In Kylin Insights window, I still cannot get the incremental
> data
> > > > (which
> > > > > has been in Hive already)
> > > > >
> > > > > That was the reason why I asked you: can I get the incremental
> result
> > > by
> > > > > above 5 steps (without model and index) or do I need to create
> model
> > > and
> > > > > index and segment then I can  get the incremental result by
> creating
> > a
> > > > new
> > > > > segment according to incremental data?
> > > > >
> > > > > Hope you get my point or I will explain more
> > > > >
> > > > > Thank you very much again
> > > > >
> > > > >
> > > > > On Fri, 17 Nov 2023 at 16:00 Xiaoxiang Yu <x...@apache.org> wrote:
> > > > >
> > > > > > Unfortunately, I guess you are not asking good questions.
> > > > > > If the answer of a question can be searched on the Internet,
> > > > > > it is not recommended to ask it in the mailing list. I guess you
> > > > > > didn't know how Kylin works, so you need to search for documents
> > > > > >  or some tutorials.
> > > > > >
> > > > > > What does 'get the incremental data from Hive into Kylin' means?
> > > Kylin
> > > > > > fully relies
> > > > > > on Apache Spark for execution.
> > > > > >
> > > > > > If incremental data is not loaded into Kylin, Kylin can still
> > answer
> > > > such
> > > > > > queries by
> > > > > > reading the original hive table, but the query is not
> accelerated.
> > > > > >
> > > > > > If incremental data is loaded into Kylin, Kylin can answer
> queries
> > by
> > > > > > reading the special Index/Cuboid files, and the query will be
> > > > > accelerated.
> > > > > >
> > > > > >
> > > > > > ------------------------
> > > > > > With warm regard
> > > > > > Xiaoxiang Yu
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 4:36 PM Nam Đỗ Duy
> <na...@vnpay.vn.invalid
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Xiaoxiang,
> > > > > > >
> > > > > > > Do I really need to create a model in order to get the
> > incremental
> > > > data
> > > > > > > from Hive into Kylin?
> > > > > > >
> > > > > > > Can I query the incremental data of a pure dim/fact table
> > without a
> > > > > > model?
> > > > > > >
> > > > > > > Thank you very much
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 9:05 AM Xiaoxiang Yu <x...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > I am not really sure. But I think it is the Query cache make
> > your
> > > > > query
> > > > > > > > result unchanged.
> > > > > > > >
> > > > > > > >
> > > > > > > > The config entry is kylin.query.cache-enabled , is turn on by
> > > > > default.
> > > > > > > > This doc links is
> > > > > > > > https://kylin.apache.org/5.0/docs/configuration/query_cache
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best wishes to you !
> > > > > > > > From ：Xiaoxiang Yu
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > At 2023-11-17 09:48:55, "Nam Đỗ Duy" <na...@vnpay.vn.INVALID
> >
> > > > wrote:
> > > > > > > > >Hello Team, hello Xiaoxiang, can you please help me with
> this
> > > > urgent
> > > > > > > > >issue...
> > > > > > > > >
> > > > > > > > >(this is public email group so in general I neglect your
> > > specific
> > > > > name
> > > > > > > > from
> > > > > > > > >greeting of first email in the threads, but in fact most of
> > time
> > > > > > > Xiaoxiang
> > > > > > > > >actively answers my issues, thank you very much)
> > > > > > > > >
> > > > > > > > >On Thu, Nov 16, 2023 at 2:59 PM Nam Đỗ Duy <na...@vnpay.vn>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >> Dear Dev Team, please kindly advise this scenario
> > > > > > > > >>
> > > > > > > > >> 1. I have a fact table and I use Kylin insights window to
> > > query
> > > > it
> > > > > > and
> > > > > > > > get
> > > > > > > > >> 5 million rows.
> > > > > > > > >>
> > > > > > > > >> 2. Then I use following command to load X rows (last hour
> > > data)
> > > > > from
> > > > > > > > >> parquet into Hive table
> > > > > > > > >>
> > > > > > > > >> LOAD DATA LOCAL INPATH
> > > > > > > > >> '/opt/LastHour/factUserEventDF_2023_11_16.parquet/14' INTO
> > > TABLE
> > > > > > > > >> factUserEvent;
> > > > > > > > >>
> > > > > > > > >> 3. Then I open Kylin insights window to query it but it
> > still
> > > > > > returned
> > > > > > > > >> previous number (5 million rows) not adding the last hour
> > data
> > > > of
> > > > > X
> > > > > > > rows
> > > > > > > > >> which I previously loaded from parquet into hive in step
> 2)
> > > > > > > > >>
> > > > > > > > >> Can you advise the way to make table refresh and updated?
> > > > > > > > >>
> > > > > > > > >> Thank you very much
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: How to reflect last hour data into Hive and Kylin Insights query window

Reply via email to