Re: How to reflect last hour data into Hive and Kylin Insights query window

Xiaoxiang Yu Tue, 28 Nov 2023 01:06:15 -0800

Sorry for my incorrect answers before. Let me make it right.

Today I tried again and reproduced the issues you reported.
The Kylin query engine may not read new files because old metadata is
cached and not be invalidated.
It is a known issues with proper solution, the solution is calling a rest
api to refresh meta cache:
https://kylin.apache.org/5.0/docs/restapi/query_api#Refresh-cached-data


Here is a sample call in my side:
curl -X PUT --user ADMIN:KYLIN -H "Content-Type:
application/json;charset=utf-8" -d '{ "tables":
["DATABASE_NAME.TABLE_NAME"]}'
http://localhost:7070/kylin/api/tables/single_catalog_cache

It is caused by a Spark's feature(introduced in 3.1.0) which tries to cache
HDFS file lists in the spark driver. (
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-refresh-table.html).
It's configuration entry is spark.sql.metadataCacheTTLSeconds

------------------------
With warm regard
Xiaoxiang Yu



On Wed, Nov 22, 2023 at 6:06 PM Xiaoxiang Yu <x...@apache.org> wrote:

> It is a good question, I can share some articles with you.
>
> 1. How to build a metric repository by Kylin to share among data teams (DA,
> DS, AI), is that the usage of measure in Kylin?
>
> I think the metric repository(or metrics store) is actually which Kylin
> can help. For example,
> Beike(ke.com) did create an indicator/metrics platform whose backend is
> Kylin. They created a metrics
> store on the top of Kylin.
>
> The architecture looks like this
> https://mmbiz.qpic.cn/mmbiz_png/9xAoGyC249Kd9icMaNT1Gs7AlDAZic7PScYNCOkSQF8PqbuSLicoxhdk4w3kJtC0bms4FzW6iby08bNiaVsUzUkBPmg/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1
>
>
> Here is technical article which wrote in Chinese about it(I am sorry this
> is not translated):
>  https://mp.weixin.qq.com/s/hsGjuaYfEfParcgTimBLnw
>
>
> 2. How to use Kylin for the Customer segmentation of Marketing dept?
>
> Here are some articles : (sorry again for these are not translated)
> https://kylin.apache.org/blog/2016/11/28/intersect-count/
> https://zhuanlan.zhihu.com/p/100131550
> https://cn.kyligence.io/blog/kylin-chinagreentown-user-portrait-2/
>
> https://cn.kyligence.io/blog/apache-kylin-count-distinct-application-in-user-behavior-analysis/
> https://www.infoq.cn/article/xZYe1DUopNA9CzLwau3O
>
> You can send your presentation material to me if you are willing to share.
>
> ------------------------
> With warm regard
> Xiaoxiang Yu
>
>
>
> On Wed, Nov 22, 2023 at 5:36 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote:
>
>> Thank you Xiaoxiang, tomorrow noon is my presentation to the management
>> about kylin so I am pending this issue to focus on following ones, can you
>> please advise:
>>
>> 1. How to build a metric repository by Kylin to share among data teams
>> (DA,
>> DS, AI), is that the usage of measure in Kylin?
>> 2. How to use Kylin for the Customer segmentation of Marketing dept?
>>
>>
>> On Wed, Nov 22, 2023 at 2:10 PM Xiaoxiang Yu <x...@apache.org> wrote:
>>
>> > Before you try again, you can use spark-sql/spark-shell to check if the
>> > data is loaded
>> > into your table successfully (or if your data is copied to the right
>> > place).
>> > Following is how to start a spark-sql/spark-shell in a container.
>> >
>> > export HADOOP_CONF_DIR=/opt/hadoop-3.2.1/etc/hadoop
>> >
>> > cd /home/kylin/apache-kylin-5.0.0-beta-bin/spark
>> >
>> > bin/spark-shell --executor-cores 1 --num-executors 1 --master yarn
>> >
>> >
>> > The result of spark-sql/spark-shell should be the same as your
>> > saw in Kylin insight page. If there are different results for the same
>> > query,
>> > which should not happen, please let me know.
>> >
>> > Hope you can fix your problem soon.
>> >
>> > ------------------------
>> > With warm regard
>> > Xiaoxiang Yu
>> >
>> >
>> >
>> > On Wed, Nov 22, 2023 at 11:59 AM Nam Đỗ Duy <na...@vnpay.vn.invalid>
>> > wrote:
>> >
>> > > Thank you Xiaoxiang, I tried in my place and it worked for the ssb
>> > database
>> > > but it didn't work for my own database.
>> > >
>> > > It only works if I restart kylin so I guess there might be some
>> > > configuration miss in my end.
>> > >
>> > > Thank you very much anyway and will update next time.
>> > >
>> > > Have a good day.
>> > >
>> > > On Fri, Nov 17, 2023 at 5:34 PM Xiaoxiang Yu <x...@apache.org> wrote:
>> > >
>> > > > I did an easy test to verify if kylin has any bugs for the push down
>> > > > function. And the push
>> > > > down function works as expected without any mistakes. So I'm 99%
>> > certain
>> > > > that
>> > > > your step "I loaded the incremental data into Hive already" does not
>> > > work.
>> > > >
>> > > > Here are my steps(you can reproduce in a fresh Kylin5 docker
>> container
>> > in
>> > > > one minute) :
>> > > >
>> > > > 1. Query `select count(*) from SSB.DATES` in project ssb without
>> > building
>> > > > any index.
>> > > >     Query result(Answered By: HIVE) is :   2556
>> > > >
>> > > > 2. Duplicate the file of table `ssb.dates` by following command:
>> > > >     hadoop fs -cp /user/hive/warehouse/ssb.db/dates/SSB.DATES.csv
>> > > > /user/hive/warehouse/ssb.db/dates/SSB.DATES-2.csv
>> > > >
>> > > > 3. Re-query `select count(*) from SSB.DATES` in project ssb
>> > > >     Query result(Answered By: HIVE) is :  5112
>> > > >
>> > > > So, it is clear that the second query incremental data can be found
>> by
>> > > the
>> > > > Kylin query engine.
>> > > >
>> > > > Finally, to make good use of Kylin in real use cases, good
>> knowledge of
>> > > > Apache Spark
>> > > > and Apache Hadoop is a must-to-have.
>> > > >
>> > > > ------------------------
>> > > > With warm regard
>> > > > Xiaoxiang Yu
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Nov 17, 2023 at 5:52 PM Nam Đỗ Duy <na...@vnpay.vn.invalid>
>> > > wrote:
>> > > >
>> > > > > Have a nice weekend Xiaoxiang, and thank you for helping me to
>> > become a
>> > > > > kylin's fan
>> > > > >
>> > > > > You are right I am not familiar with Kylin enough and have little
>> > > > > background of the hadoop system so I will double check here
>> carefully
>> > > > > before
>> > > > > future questions. However I did understand the following mechanism
>> > > > > in quotes.
>> > > > >
>> > > > > ============quoted====================
>> > > > >
>> > > > > If incremental data is not loaded into Kylin, Kylin can still
>> answer
>> > > such
>> > > > > queries by
>> > > > > reading the original hive table, but the query is not accelerated.
>> > > > >
>> > > > > If incremental data is loaded into Kylin, Kylin can answer
>> queries by
>> > > > > reading the special Index/Cuboid files, and the query will be
>> > > > accelerated.
>> > > > >
>> > > > > ============end====================
>> > > > >
>> > > > > I explain my previous question that was as follows:
>> > > > >
>> > > > > 1. I turned off this configuration kylin.query.cache-enabled (set
>> =
>> > > > false)
>> > > > > 2. Restart Kylin
>> > > > > 3. I loaded the incremental data into Hive already
>> > > > > 4. Turn on Pushdown option to query Hive not model
>> > > > > 5. In Kylin Insights window, I still cannot get the incremental
>> data
>> > > > (which
>> > > > > has been in Hive already)
>> > > > >
>> > > > > That was the reason why I asked you: can I get the incremental
>> result
>> > > by
>> > > > > above 5 steps (without model and index) or do I need to create
>> model
>> > > and
>> > > > > index and segment then I can  get the incremental result by
>> creating
>> > a
>> > > > new
>> > > > > segment according to incremental data?
>> > > > >
>> > > > > Hope you get my point or I will explain more
>> > > > >
>> > > > > Thank you very much again
>> > > > >
>> > > > >
>> > > > > On Fri, 17 Nov 2023 at 16:00 Xiaoxiang Yu <x...@apache.org>
>> wrote:
>> > > > >
>> > > > > > Unfortunately, I guess you are not asking good questions.
>> > > > > > If the answer of a question can be searched on the Internet,
>> > > > > > it is not recommended to ask it in the mailing list. I guess you
>> > > > > > didn't know how Kylin works, so you need to search for documents
>> > > > > >  or some tutorials.
>> > > > > >
>> > > > > > What does 'get the incremental data from Hive into Kylin' means?
>> > > Kylin
>> > > > > > fully relies
>> > > > > > on Apache Spark for execution.
>> > > > > >
>> > > > > > If incremental data is not loaded into Kylin, Kylin can still
>> > answer
>> > > > such
>> > > > > > queries by
>> > > > > > reading the original hive table, but the query is not
>> accelerated.
>> > > > > >
>> > > > > > If incremental data is loaded into Kylin, Kylin can answer
>> queries
>> > by
>> > > > > > reading the special Index/Cuboid files, and the query will be
>> > > > > accelerated.
>> > > > > >
>> > > > > >
>> > > > > > ------------------------
>> > > > > > With warm regard
>> > > > > > Xiaoxiang Yu
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Fri, Nov 17, 2023 at 4:36 PM Nam Đỗ Duy
>> <na...@vnpay.vn.invalid
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi Xiaoxiang,
>> > > > > > >
>> > > > > > > Do I really need to create a model in order to get the
>> > incremental
>> > > > data
>> > > > > > > from Hive into Kylin?
>> > > > > > >
>> > > > > > > Can I query the incremental data of a pure dim/fact table
>> > without a
>> > > > > > model?
>> > > > > > >
>> > > > > > > Thank you very much
>> > > > > > >
>> > > > > > > On Fri, Nov 17, 2023 at 9:05 AM Xiaoxiang Yu <x...@apache.org
>> >
>> > > > wrote:
>> > > > > > >
>> > > > > > > > I am not really sure. But I think it is the Query cache make
>> > your
>> > > > > query
>> > > > > > > > result unchanged.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > The config entry is kylin.query.cache-enabled , is turn on
>> by
>> > > > > default.
>> > > > > > > > This doc links is
>> > > > > > > > https://kylin.apache.org/5.0/docs/configuration/query_cache
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > >
>> > > > > > > > Best wishes to you !
>> > > > > > > > From ：Xiaoxiang Yu
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > At 2023-11-17 09:48:55, "Nam Đỗ Duy" <na...@vnpay.vn.INVALID
>> >
>> > > > wrote:
>> > > > > > > > >Hello Team, hello Xiaoxiang, can you please help me with
>> this
>> > > > urgent
>> > > > > > > > >issue...
>> > > > > > > > >
>> > > > > > > > >(this is public email group so in general I neglect your
>> > > specific
>> > > > > name
>> > > > > > > > from
>> > > > > > > > >greeting of first email in the threads, but in fact most of
>> > time
>> > > > > > > Xiaoxiang
>> > > > > > > > >actively answers my issues, thank you very much)
>> > > > > > > > >
>> > > > > > > > >On Thu, Nov 16, 2023 at 2:59 PM Nam Đỗ Duy <na...@vnpay.vn
>> >
>> > > > wrote:
>> > > > > > > > >
>> > > > > > > > >> Dear Dev Team, please kindly advise this scenario
>> > > > > > > > >>
>> > > > > > > > >> 1. I have a fact table and I use Kylin insights window to
>> > > query
>> > > > it
>> > > > > > and
>> > > > > > > > get
>> > > > > > > > >> 5 million rows.
>> > > > > > > > >>
>> > > > > > > > >> 2. Then I use following command to load X rows (last hour
>> > > data)
>> > > > > from
>> > > > > > > > >> parquet into Hive table
>> > > > > > > > >>
>> > > > > > > > >> LOAD DATA LOCAL INPATH
>> > > > > > > > >> '/opt/LastHour/factUserEventDF_2023_11_16.parquet/14'
>> INTO
>> > > TABLE
>> > > > > > > > >> factUserEvent;
>> > > > > > > > >>
>> > > > > > > > >> 3. Then I open Kylin insights window to query it but it
>> > still
>> > > > > > returned
>> > > > > > > > >> previous number (5 million rows) not adding the last hour
>> > data
>> > > > of
>> > > > > X
>> > > > > > > rows
>> > > > > > > > >> which I previously loaded from parquet into hive in step
>> 2)
>> > > > > > > > >>
>> > > > > > > > >> Can you advise the way to make table refresh and updated?
>> > > > > > > > >>
>> > > > > > > > >> Thank you very much
>> > > > > > > > >>
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: How to reflect last hour data into Hive and Kylin Insights query window

Reply via email to