Sorry for my incorrect answers before. Let me make it right. Today I tried again and reproduced the issues you reported. The Kylin query engine may not read new files because old metadata is cached and not be invalidated. It is a known issues with proper solution, the solution is calling a rest api to refresh meta cache: https://kylin.apache.org/5.0/docs/restapi/query_api#Refresh-cached-data
Here is a sample call in my side: curl -X PUT --user ADMIN:KYLIN -H "Content-Type: application/json;charset=utf-8" -d '{ "tables": ["DATABASE_NAME.TABLE_NAME"]}' http://localhost:7070/kylin/api/tables/single_catalog_cache It is caused by a Spark's feature(introduced in 3.1.0) which tries to cache HDFS file lists in the spark driver. ( https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-refresh-table.html). It's configuration entry is spark.sql.metadataCacheTTLSeconds ------------------------ With warm regard Xiaoxiang Yu On Wed, Nov 22, 2023 at 6:06 PM Xiaoxiang Yu <x...@apache.org> wrote: > It is a good question, I can share some articles with you. > > 1. How to build a metric repository by Kylin to share among data teams (DA, > DS, AI), is that the usage of measure in Kylin? > > I think the metric repository(or metrics store) is actually which Kylin > can help. For example, > Beike(ke.com) did create an indicator/metrics platform whose backend is > Kylin. They created a metrics > store on the top of Kylin. > > The architecture looks like this > https://mmbiz.qpic.cn/mmbiz_png/9xAoGyC249Kd9icMaNT1Gs7AlDAZic7PScYNCOkSQF8PqbuSLicoxhdk4w3kJtC0bms4FzW6iby08bNiaVsUzUkBPmg/640?wx_fmt=png&wxfrom=5&wx_lazy=1&wx_co=1 > > > Here is technical article which wrote in Chinese about it(I am sorry this > is not translated): > https://mp.weixin.qq.com/s/hsGjuaYfEfParcgTimBLnw > > > 2. How to use Kylin for the Customer segmentation of Marketing dept? > > Here are some articles : (sorry again for these are not translated) > https://kylin.apache.org/blog/2016/11/28/intersect-count/ > https://zhuanlan.zhihu.com/p/100131550 > https://cn.kyligence.io/blog/kylin-chinagreentown-user-portrait-2/ > > https://cn.kyligence.io/blog/apache-kylin-count-distinct-application-in-user-behavior-analysis/ > https://www.infoq.cn/article/xZYe1DUopNA9CzLwau3O > > You can send your presentation material to me if you are willing to share. > > ------------------------ > With warm regard > Xiaoxiang Yu > > > > On Wed, Nov 22, 2023 at 5:36 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> wrote: > >> Thank you Xiaoxiang, tomorrow noon is my presentation to the management >> about kylin so I am pending this issue to focus on following ones, can you >> please advise: >> >> 1. How to build a metric repository by Kylin to share among data teams >> (DA, >> DS, AI), is that the usage of measure in Kylin? >> 2. How to use Kylin for the Customer segmentation of Marketing dept? >> >> >> On Wed, Nov 22, 2023 at 2:10 PM Xiaoxiang Yu <x...@apache.org> wrote: >> >> > Before you try again, you can use spark-sql/spark-shell to check if the >> > data is loaded >> > into your table successfully (or if your data is copied to the right >> > place). >> > Following is how to start a spark-sql/spark-shell in a container. >> > >> > export HADOOP_CONF_DIR=/opt/hadoop-3.2.1/etc/hadoop >> > >> > cd /home/kylin/apache-kylin-5.0.0-beta-bin/spark >> > >> > bin/spark-shell --executor-cores 1 --num-executors 1 --master yarn >> > >> > >> > The result of spark-sql/spark-shell should be the same as your >> > saw in Kylin insight page. If there are different results for the same >> > query, >> > which should not happen, please let me know. >> > >> > Hope you can fix your problem soon. >> > >> > ------------------------ >> > With warm regard >> > Xiaoxiang Yu >> > >> > >> > >> > On Wed, Nov 22, 2023 at 11:59 AM Nam Đỗ Duy <na...@vnpay.vn.invalid> >> > wrote: >> > >> > > Thank you Xiaoxiang, I tried in my place and it worked for the ssb >> > database >> > > but it didn't work for my own database. >> > > >> > > It only works if I restart kylin so I guess there might be some >> > > configuration miss in my end. >> > > >> > > Thank you very much anyway and will update next time. >> > > >> > > Have a good day. >> > > >> > > On Fri, Nov 17, 2023 at 5:34 PM Xiaoxiang Yu <x...@apache.org> wrote: >> > > >> > > > I did an easy test to verify if kylin has any bugs for the push down >> > > > function. And the push >> > > > down function works as expected without any mistakes. So I'm 99% >> > certain >> > > > that >> > > > your step "I loaded the incremental data into Hive already" does not >> > > work. >> > > > >> > > > Here are my steps(you can reproduce in a fresh Kylin5 docker >> container >> > in >> > > > one minute) : >> > > > >> > > > 1. Query `select count(*) from SSB.DATES` in project ssb without >> > building >> > > > any index. >> > > > Query result(Answered By: HIVE) is : 2556 >> > > > >> > > > 2. Duplicate the file of table `ssb.dates` by following command: >> > > > hadoop fs -cp /user/hive/warehouse/ssb.db/dates/SSB.DATES.csv >> > > > /user/hive/warehouse/ssb.db/dates/SSB.DATES-2.csv >> > > > >> > > > 3. Re-query `select count(*) from SSB.DATES` in project ssb >> > > > Query result(Answered By: HIVE) is : 5112 >> > > > >> > > > So, it is clear that the second query incremental data can be found >> by >> > > the >> > > > Kylin query engine. >> > > > >> > > > Finally, to make good use of Kylin in real use cases, good >> knowledge of >> > > > Apache Spark >> > > > and Apache Hadoop is a must-to-have. >> > > > >> > > > ------------------------ >> > > > With warm regard >> > > > Xiaoxiang Yu >> > > > >> > > > >> > > > >> > > > On Fri, Nov 17, 2023 at 5:52 PM Nam Đỗ Duy <na...@vnpay.vn.invalid> >> > > wrote: >> > > > >> > > > > Have a nice weekend Xiaoxiang, and thank you for helping me to >> > become a >> > > > > kylin's fan >> > > > > >> > > > > You are right I am not familiar with Kylin enough and have little >> > > > > background of the hadoop system so I will double check here >> carefully >> > > > > before >> > > > > future questions. However I did understand the following mechanism >> > > > > in quotes. >> > > > > >> > > > > ============quoted==================== >> > > > > >> > > > > If incremental data is not loaded into Kylin, Kylin can still >> answer >> > > such >> > > > > queries by >> > > > > reading the original hive table, but the query is not accelerated. >> > > > > >> > > > > If incremental data is loaded into Kylin, Kylin can answer >> queries by >> > > > > reading the special Index/Cuboid files, and the query will be >> > > > accelerated. >> > > > > >> > > > > ============end==================== >> > > > > >> > > > > I explain my previous question that was as follows: >> > > > > >> > > > > 1. I turned off this configuration kylin.query.cache-enabled (set >> = >> > > > false) >> > > > > 2. Restart Kylin >> > > > > 3. I loaded the incremental data into Hive already >> > > > > 4. Turn on Pushdown option to query Hive not model >> > > > > 5. In Kylin Insights window, I still cannot get the incremental >> data >> > > > (which >> > > > > has been in Hive already) >> > > > > >> > > > > That was the reason why I asked you: can I get the incremental >> result >> > > by >> > > > > above 5 steps (without model and index) or do I need to create >> model >> > > and >> > > > > index and segment then I can get the incremental result by >> creating >> > a >> > > > new >> > > > > segment according to incremental data? >> > > > > >> > > > > Hope you get my point or I will explain more >> > > > > >> > > > > Thank you very much again >> > > > > >> > > > > >> > > > > On Fri, 17 Nov 2023 at 16:00 Xiaoxiang Yu <x...@apache.org> >> wrote: >> > > > > >> > > > > > Unfortunately, I guess you are not asking good questions. >> > > > > > If the answer of a question can be searched on the Internet, >> > > > > > it is not recommended to ask it in the mailing list. I guess you >> > > > > > didn't know how Kylin works, so you need to search for documents >> > > > > > or some tutorials. >> > > > > > >> > > > > > What does 'get the incremental data from Hive into Kylin' means? >> > > Kylin >> > > > > > fully relies >> > > > > > on Apache Spark for execution. >> > > > > > >> > > > > > If incremental data is not loaded into Kylin, Kylin can still >> > answer >> > > > such >> > > > > > queries by >> > > > > > reading the original hive table, but the query is not >> accelerated. >> > > > > > >> > > > > > If incremental data is loaded into Kylin, Kylin can answer >> queries >> > by >> > > > > > reading the special Index/Cuboid files, and the query will be >> > > > > accelerated. >> > > > > > >> > > > > > >> > > > > > ------------------------ >> > > > > > With warm regard >> > > > > > Xiaoxiang Yu >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Fri, Nov 17, 2023 at 4:36 PM Nam Đỗ Duy >> <na...@vnpay.vn.invalid >> > > >> > > > > wrote: >> > > > > > >> > > > > > > Hi Xiaoxiang, >> > > > > > > >> > > > > > > Do I really need to create a model in order to get the >> > incremental >> > > > data >> > > > > > > from Hive into Kylin? >> > > > > > > >> > > > > > > Can I query the incremental data of a pure dim/fact table >> > without a >> > > > > > model? >> > > > > > > >> > > > > > > Thank you very much >> > > > > > > >> > > > > > > On Fri, Nov 17, 2023 at 9:05 AM Xiaoxiang Yu <x...@apache.org >> > >> > > > wrote: >> > > > > > > >> > > > > > > > I am not really sure. But I think it is the Query cache make >> > your >> > > > > query >> > > > > > > > result unchanged. >> > > > > > > > >> > > > > > > > >> > > > > > > > The config entry is kylin.query.cache-enabled , is turn on >> by >> > > > > default. >> > > > > > > > This doc links is >> > > > > > > > https://kylin.apache.org/5.0/docs/configuration/query_cache >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > >> > > > > > > > Best wishes to you ! >> > > > > > > > From :Xiaoxiang Yu >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > At 2023-11-17 09:48:55, "Nam Đỗ Duy" <na...@vnpay.vn.INVALID >> > >> > > > wrote: >> > > > > > > > >Hello Team, hello Xiaoxiang, can you please help me with >> this >> > > > urgent >> > > > > > > > >issue... >> > > > > > > > > >> > > > > > > > >(this is public email group so in general I neglect your >> > > specific >> > > > > name >> > > > > > > > from >> > > > > > > > >greeting of first email in the threads, but in fact most of >> > time >> > > > > > > Xiaoxiang >> > > > > > > > >actively answers my issues, thank you very much) >> > > > > > > > > >> > > > > > > > >On Thu, Nov 16, 2023 at 2:59 PM Nam Đỗ Duy <na...@vnpay.vn >> > >> > > > wrote: >> > > > > > > > > >> > > > > > > > >> Dear Dev Team, please kindly advise this scenario >> > > > > > > > >> >> > > > > > > > >> 1. I have a fact table and I use Kylin insights window to >> > > query >> > > > it >> > > > > > and >> > > > > > > > get >> > > > > > > > >> 5 million rows. >> > > > > > > > >> >> > > > > > > > >> 2. Then I use following command to load X rows (last hour >> > > data) >> > > > > from >> > > > > > > > >> parquet into Hive table >> > > > > > > > >> >> > > > > > > > >> LOAD DATA LOCAL INPATH >> > > > > > > > >> '/opt/LastHour/factUserEventDF_2023_11_16.parquet/14' >> INTO >> > > TABLE >> > > > > > > > >> factUserEvent; >> > > > > > > > >> >> > > > > > > > >> 3. Then I open Kylin insights window to query it but it >> > still >> > > > > > returned >> > > > > > > > >> previous number (5 million rows) not adding the last hour >> > data >> > > > of >> > > > > X >> > > > > > > rows >> > > > > > > > >> which I previously loaded from parquet into hive in step >> 2) >> > > > > > > > >> >> > > > > > > > >> Can you advise the way to make table refresh and updated? >> > > > > > > > >> >> > > > > > > > >> Thank you very much >> > > > > > > > >> >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >