Re: How to reflect last hour data into Hive and Kylin Insights query window

Nam Đỗ Duy Tue, 28 Nov 2023 01:17:51 -0800

Thank you very much Xiaoxiang

Will try your suggestion soon


I’ve presented quite OK and we decided to use Kylin in dev environment
before using it in production environment

Please continue to help us to master it

Thank you again

On Tue, 28 Nov 2023 at 16:06 Xiaoxiang Yu <[email protected]> wrote:

> Sorry for my incorrect answers before. Let me make it right.
>
> Today I tried again and reproduced the issues you reported.
> The Kylin query engine may not read new files because old metadata is
> cached and not be invalidated.
> It is a known issues with proper solution, the solution is calling a rest
> api to refresh meta cache:
> https://kylin.apache.org/5.0/docs/restapi/query_api#Refresh-cached-data
>
> Here is a sample call in my side:
> curl -X PUT --user ADMIN:KYLIN -H "Content-Type:
> application/json;charset=utf-8" -d '{ "tables":
> ["DATABASE_NAME.TABLE_NAME"]}'
> http://localhost:7070/kylin/api/tables/single_catalog_cache
>
> It is caused by a Spark's feature(introduced in 3.1.0) which tries to cache
> HDFS file lists in the spark driver. (
>
> https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-refresh-table.html
> ).
> It's configuration entry is spark.sql.metadataCacheTTLSeconds
>
> ------------------------
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Wed, Nov 22, 2023 at 6:06 PM Xiaoxiang Yu <[email protected]> wrote:
>
> > It is a good question, I can share some articles with you.
> >
> > 1. How to build a metric repository by Kylin to share among data teams
> (DA,
> > DS, AI), is that the usage of measure in Kylin?
> >
> > I think the metric repository(or metrics store) is actually which Kylin
> > can help. For example,
> > Beike(ke.com) did create an indicator/metrics platform whose backend is
> > Kylin. They created a metrics
> > store on the top of Kylin.
> >
> > The architecture looks like this
> >
> https://mmbiz.qpic.cn/mmbiz_png/9xAoGyC249Kd9icMaNT1Gs7AlDAZic7PScYNCOkSQF8PqbuSLicoxhdk4w3kJtC0bms4FzW6iby08bNiaVsUzUkBPmg/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1
> >
> >
> > Here is technical article which wrote in Chinese about it(I am sorry this
> > is not translated):
> >  https://mp.weixin.qq.com/s/hsGjuaYfEfParcgTimBLnw
> >
> >
> > 2. How to use Kylin for the Customer segmentation of Marketing dept?
> >
> > Here are some articles : (sorry again for these are not translated)
> > https://kylin.apache.org/blog/2016/11/28/intersect-count/
> > https://zhuanlan.zhihu.com/p/100131550
> > https://cn.kyligence.io/blog/kylin-chinagreentown-user-portrait-2/
> >
> >
> https://cn.kyligence.io/blog/apache-kylin-count-distinct-application-in-user-behavior-analysis/
> > https://www.infoq.cn/article/xZYe1DUopNA9CzLwau3O
> >
> > You can send your presentation material to me if you are willing to
> share.
> >
> > ------------------------
> > With warm regard
> > Xiaoxiang Yu
> >
> >
> >
> > On Wed, Nov 22, 2023 at 5:36 PM Nam Đỗ Duy <[email protected]>
> wrote:
> >
> >> Thank you Xiaoxiang, tomorrow noon is my presentation to the management
> >> about kylin so I am pending this issue to focus on following ones, can
> you
> >> please advise:
> >>
> >> 1. How to build a metric repository by Kylin to share among data teams
> >> (DA,
> >> DS, AI), is that the usage of measure in Kylin?
> >> 2. How to use Kylin for the Customer segmentation of Marketing dept?
> >>
> >>
> >> On Wed, Nov 22, 2023 at 2:10 PM Xiaoxiang Yu <[email protected]> wrote:
> >>
> >> > Before you try again, you can use spark-sql/spark-shell to check if
> the
> >> > data is loaded
> >> > into your table successfully (or if your data is copied to the right
> >> > place).
> >> > Following is how to start a spark-sql/spark-shell in a container.
> >> >
> >> > export HADOOP_CONF_DIR=/opt/hadoop-3.2.1/etc/hadoop
> >> >
> >> > cd /home/kylin/apache-kylin-5.0.0-beta-bin/spark
> >> >
> >> > bin/spark-shell --executor-cores 1 --num-executors 1 --master yarn
> >> >
> >> >
> >> > The result of spark-sql/spark-shell should be the same as your
> >> > saw in Kylin insight page. If there are different results for the same
> >> > query,
> >> > which should not happen, please let me know.
> >> >
> >> > Hope you can fix your problem soon.
> >> >
> >> > ------------------------
> >> > With warm regard
> >> > Xiaoxiang Yu
> >> >
> >> >
> >> >
> >> > On Wed, Nov 22, 2023 at 11:59 AM Nam Đỗ Duy <[email protected]>
> >> > wrote:
> >> >
> >> > > Thank you Xiaoxiang, I tried in my place and it worked for the ssb
> >> > database
> >> > > but it didn't work for my own database.
> >> > >
> >> > > It only works if I restart kylin so I guess there might be some
> >> > > configuration miss in my end.
> >> > >
> >> > > Thank you very much anyway and will update next time.
> >> > >
> >> > > Have a good day.
> >> > >
> >> > > On Fri, Nov 17, 2023 at 5:34 PM Xiaoxiang Yu <[email protected]>
> wrote:
> >> > >
> >> > > > I did an easy test to verify if kylin has any bugs for the push
> down
> >> > > > function. And the push
> >> > > > down function works as expected without any mistakes. So I'm 99%
> >> > certain
> >> > > > that
> >> > > > your step "I loaded the incremental data into Hive already" does
> not
> >> > > work.
> >> > > >
> >> > > > Here are my steps(you can reproduce in a fresh Kylin5 docker
> >> container
> >> > in
> >> > > > one minute) :
> >> > > >
> >> > > > 1. Query `select count(*) from SSB.DATES` in project ssb without
> >> > building
> >> > > > any index.
> >> > > >     Query result(Answered By: HIVE) is :   2556
> >> > > >
> >> > > > 2. Duplicate the file of table `ssb.dates` by following command:
> >> > > >     hadoop fs -cp /user/hive/warehouse/ssb.db/dates/SSB.DATES.csv
> >> > > > /user/hive/warehouse/ssb.db/dates/SSB.DATES-2.csv
> >> > > >
> >> > > > 3. Re-query `select count(*) from SSB.DATES` in project ssb
> >> > > >     Query result(Answered By: HIVE) is :  5112
> >> > > >
> >> > > > So, it is clear that the second query incremental data can be
> found
> >> by
> >> > > the
> >> > > > Kylin query engine.
> >> > > >
> >> > > > Finally, to make good use of Kylin in real use cases, good
> >> knowledge of
> >> > > > Apache Spark
> >> > > > and Apache Hadoop is a must-to-have.
> >> > > >
> >> > > > ------------------------
> >> > > > With warm regard
> >> > > > Xiaoxiang Yu
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Fri, Nov 17, 2023 at 5:52 PM Nam Đỗ Duy <[email protected]
> >
> >> > > wrote:
> >> > > >
> >> > > > > Have a nice weekend Xiaoxiang, and thank you for helping me to
> >> > become a
> >> > > > > kylin's fan
> >> > > > >
> >> > > > > You are right I am not familiar with Kylin enough and have
> little
> >> > > > > background of the hadoop system so I will double check here
> >> carefully
> >> > > > > before
> >> > > > > future questions. However I did understand the following
> mechanism
> >> > > > > in quotes.
> >> > > > >
> >> > > > > ============quoted====================
> >> > > > >
> >> > > > > If incremental data is not loaded into Kylin, Kylin can still
> >> answer
> >> > > such
> >> > > > > queries by
> >> > > > > reading the original hive table, but the query is not
> accelerated.
> >> > > > >
> >> > > > > If incremental data is loaded into Kylin, Kylin can answer
> >> queries by
> >> > > > > reading the special Index/Cuboid files, and the query will be
> >> > > > accelerated.
> >> > > > >
> >> > > > > ============end====================
> >> > > > >
> >> > > > > I explain my previous question that was as follows:
> >> > > > >
> >> > > > > 1. I turned off this configuration kylin.query.cache-enabled
> (set
> >> =
> >> > > > false)
> >> > > > > 2. Restart Kylin
> >> > > > > 3. I loaded the incremental data into Hive already
> >> > > > > 4. Turn on Pushdown option to query Hive not model
> >> > > > > 5. In Kylin Insights window, I still cannot get the incremental
> >> data
> >> > > > (which
> >> > > > > has been in Hive already)
> >> > > > >
> >> > > > > That was the reason why I asked you: can I get the incremental
> >> result
> >> > > by
> >> > > > > above 5 steps (without model and index) or do I need to create
> >> model
> >> > > and
> >> > > > > index and segment then I can  get the incremental result by
> >> creating
> >> > a
> >> > > > new
> >> > > > > segment according to incremental data?
> >> > > > >
> >> > > > > Hope you get my point or I will explain more
> >> > > > >
> >> > > > > Thank you very much again
> >> > > > >
> >> > > > >
> >> > > > > On Fri, 17 Nov 2023 at 16:00 Xiaoxiang Yu <[email protected]>
> >> wrote:
> >> > > > >
> >> > > > > > Unfortunately, I guess you are not asking good questions.
> >> > > > > > If the answer of a question can be searched on the Internet,
> >> > > > > > it is not recommended to ask it in the mailing list. I guess
> you
> >> > > > > > didn't know how Kylin works, so you need to search for
> documents
> >> > > > > >  or some tutorials.
> >> > > > > >
> >> > > > > > What does 'get the incremental data from Hive into Kylin'
> means?
> >> > > Kylin
> >> > > > > > fully relies
> >> > > > > > on Apache Spark for execution.
> >> > > > > >
> >> > > > > > If incremental data is not loaded into Kylin, Kylin can still
> >> > answer
> >> > > > such
> >> > > > > > queries by
> >> > > > > > reading the original hive table, but the query is not
> >> accelerated.
> >> > > > > >
> >> > > > > > If incremental data is loaded into Kylin, Kylin can answer
> >> queries
> >> > by
> >> > > > > > reading the special Index/Cuboid files, and the query will be
> >> > > > > accelerated.
> >> > > > > >
> >> > > > > >
> >> > > > > > ------------------------
> >> > > > > > With warm regard
> >> > > > > > Xiaoxiang Yu
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Fri, Nov 17, 2023 at 4:36 PM Nam Đỗ Duy
> >> <[email protected]
> >> > >
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi Xiaoxiang,
> >> > > > > > >
> >> > > > > > > Do I really need to create a model in order to get the
> >> > incremental
> >> > > > data
> >> > > > > > > from Hive into Kylin?
> >> > > > > > >
> >> > > > > > > Can I query the incremental data of a pure dim/fact table
> >> > without a
> >> > > > > > model?
> >> > > > > > >
> >> > > > > > > Thank you very much
> >> > > > > > >
> >> > > > > > > On Fri, Nov 17, 2023 at 9:05 AM Xiaoxiang Yu <
> [email protected]
> >> >
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > I am not really sure. But I think it is the Query cache
> make
> >> > your
> >> > > > > query
> >> > > > > > > > result unchanged.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > The config entry is kylin.query.cache-enabled , is turn on
> >> by
> >> > > > > default.
> >> > > > > > > > This doc links is
> >> > > > > > > >
> https://kylin.apache.org/5.0/docs/configuration/query_cache
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > >
> >> > > > > > > > Best wishes to you !
> >> > > > > > > > From ：Xiaoxiang Yu
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > At 2023-11-17 09:48:55, "Nam Đỗ Duy"
> <[email protected]
> >> >
> >> > > > wrote:
> >> > > > > > > > >Hello Team, hello Xiaoxiang, can you please help me with
> >> this
> >> > > > urgent
> >> > > > > > > > >issue...
> >> > > > > > > > >
> >> > > > > > > > >(this is public email group so in general I neglect your
> >> > > specific
> >> > > > > name
> >> > > > > > > > from
> >> > > > > > > > >greeting of first email in the threads, but in fact most
> of
> >> > time
> >> > > > > > > Xiaoxiang
> >> > > > > > > > >actively answers my issues, thank you very much)
> >> > > > > > > > >
> >> > > > > > > > >On Thu, Nov 16, 2023 at 2:59 PM Nam Đỗ Duy <
> [email protected]
> >> >
> >> > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> Dear Dev Team, please kindly advise this scenario
> >> > > > > > > > >>
> >> > > > > > > > >> 1. I have a fact table and I use Kylin insights window
> to
> >> > > query
> >> > > > it
> >> > > > > > and
> >> > > > > > > > get
> >> > > > > > > > >> 5 million rows.
> >> > > > > > > > >>
> >> > > > > > > > >> 2. Then I use following command to load X rows (last
> hour
> >> > > data)
> >> > > > > from
> >> > > > > > > > >> parquet into Hive table
> >> > > > > > > > >>
> >> > > > > > > > >> LOAD DATA LOCAL INPATH
> >> > > > > > > > >> '/opt/LastHour/factUserEventDF_2023_11_16.parquet/14'
> >> INTO
> >> > > TABLE
> >> > > > > > > > >> factUserEvent;
> >> > > > > > > > >>
> >> > > > > > > > >> 3. Then I open Kylin insights window to query it but it
> >> > still
> >> > > > > > returned
> >> > > > > > > > >> previous number (5 million rows) not adding the last
> hour
> >> > data
> >> > > > of
> >> > > > > X
> >> > > > > > > rows
> >> > > > > > > > >> which I previously loaded from parquet into hive in
> step
> >> 2)
> >> > > > > > > > >>
> >> > > > > > > > >> Can you advise the way to make table refresh and
> updated?
> >> > > > > > > > >>
> >> > > > > > > > >> Thank you very much
> >> > > > > > > > >>
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: How to reflect last hour data into Hive and Kylin Insights query window

Reply via email to