That depends on your USER_ID carnality. I think your USER_ID should have duplicated values between segments, that??s why you use count **distinct**. If the USER_ID always different and show up only once, just count should be fine, no need to count **distinct**.
If the USER_ID carnality indeed over 2 billion, maybe you need create one cube every 21 days, and combine them into one hybrid cube? I??m not sure whether it worked, you can check http://kylin.apache.org/blog/2015/09/25/hybrid-model/ and have a try. > ?? 2016??8??25????12:56??lxw <lxw1...@qq.com> ?????? > > Thanks, I got it. > > We have 100 million new USER_IDs per day (segment), that means after 21 days, > the building task will be failed? > And we can't use "Precisely Count Distinct" in out scene? > > > > > > ------------------ ???????? ------------------ > ??????: "Yerui Sun";<sunye...@gmail.com>; > ????????: 2016??8??25??(??????) ????11:55 > ??????: "dev"<dev@kylin.apache.org>; > > ????: Re: Precisely Count Distinct on 100 million string values column > > > > lxw, > If the values exceed Integer.MAX_VALUE, exception will be threw when > dictionary building. > > You can firstly disable cube and then edit the json on web ui. The action > button is in the ??Admins?? of cube list table. > > BTW, the 255 limitation could be removed in theory, however, that made the > logic more complicated. You can have a try and contribute the patch if > you??re interested. > > Yiming, > I will post a patch for more clearly exception message and some minor improve > of GlobalDictionary. > But maybe later, it??s quite a busy week... > >> ?? 2016??8??25????10:05??lxw <lxw1...@qq.com> ?????? >> >> Sorry, >> >> About question 1, >> I means if count distinct values of column data cross all segments exceed >> Integer.MAX_VALUE, what will be happened? >> >> >> >> ------------------ ???????? ------------------ >> ??????: "lxw";<lxw1...@qq.com>; >> ????????: 2016??8??25??(??????) ????10:01 >> ??????: "dev"<dev@kylin.apache.org>; >> >> ????: ?????? Precisely Count Distinct on 100 million string values column >> >> >> >> I have 2 more questions: >> >> 1. The capacity of the global dictionary is Integer.MAX_VALUE? If count >> distinct values of column data cross all segments, what will be happened? >> duplication or error ? >> >> 2. Where I can manually edit a cube desc json? Now I use JAVA API to create >> or update cube. >> >> Thanks! >> >> >> >> ------------------ ???????? ------------------ >> ??????: "Yiming Liu";<liuyiming....@gmail.com>; >> ????????: 2016??8??25??(??????) ????9:41 >> ??????: "dev"<dev@kylin.apache.org>; "sunyerui"<sunye...@gmail.com>; >> >> ????: Re: Precisely Count Distinct on 100 million string values column >> >> >> >> Good found. >> >> The code AppendTrieDictionary line 604: >> >> // nValueBytes >> if (n.part.length > 255) >> throw new RuntimeException(); >> >> Hi Yerui, >> >> Could you add more comments for the 255 limit, with more meaningful >> exception? >> >> >> 2016-08-24 20:44 GMT+08:00 lxw <lxw1...@qq.com>: >> >>> It caused by length(USER_ID) > 255. >>> After exclude these dirty data, it works . >>> >>> >>> Total 150 million records, execute this query: >>> >>> select city_code, >>> sum(bid_request) as bid_request, >>> count(distinct user_id) as uv >>> from liuxiaowen.TEST_T_PBS_UV_FACT >>> group by city_code >>> order by uv desc limit 100 >>> >>> Kylin cost 7 seconds, and Hive cost 180 seconds, the result is same. >>> >>> >>> >>> ------------------ Original ------------------ >>> From: "lxw";<lxw1...@qq.com>; >>> Date: Wed, Aug 24, 2016 05:27 PM >>> To: "dev"<dev@kylin.apache.org>; >>> >>> Subject: Precisely Count Distinct on 100 million string values column >>> >>> >>> >>> Hi, >>> >>> I am trying to use "Precisely Count Distinct" on 100 million string >>> values column "USER_ID", I updated the cube json : >>> "dictionaries": [ { "column": "USER_ID", "builder": >>> "org.apache.kylin.dict.GlobalDictionaryBuilder" } ], >>> >>> "override_kylin_properties": { >>> "kylin.job.mr.config.override.mapred.map.child.java.opts": >>> "-Xmx7g", "kylin.job.mr.config.override.mapreduce.map.memory.mb": >>> "7168" } when I build the cube, an error occurred on "#4 Step Name: >>> Build Dimension Dictionary", >>> the error log in "kylin.log" : >>> >>> 2016-08-24 17:27:53,282 ERROR [pool-7-thread-10] dict.CachedTreeMap:239 : >>> write value into /kylin_test1/kylin_metadata_test1/resources/GlobalDict/ >>> dict/LIUXIAOWEN.TEST_T_PBS_UV_FACT/USER_ID.tmp/cached_ >>> AQEByQXVzFd8r0YviP4x84YqUv-NcRiuCI2d exception: java.lang.RuntimeException >>> java.lang.RuntimeException >>> at org.apache.kylin.dict.AppendTrieDictionary$DictNode. >>> build_writeNode(AppendTrieDictionary.java:605) >>> at org.apache.kylin.dict.AppendTrieDictionary$DictNode. >>> buildTrieBytes(AppendTrieDictionary.java:576) >>> at org.apache.kylin.dict.AppendTrieDictionary$DictNode. >>> write(AppendTrieDictionary.java:523) >>> at org.apache.kylin.dict.CachedTreeMap.writeValue( >>> CachedTreeMap.java:234) >>> at org.apache.kylin.dict.CachedTreeMap.write( >>> CachedTreeMap.java:374) >>> at org.apache.kylin.dict.AppendTrieDictionary.flushIndex( >>> AppendTrieDictionary.java:1043) >>> at org.apache.kylin.dict.AppendTrieDictionary$Builder. >>> build(AppendTrieDictionary.java:954) >>> at org.apache.kylin.dict.GlobalDictionaryBuilder.build( >>> GlobalDictionaryBuilder.java:82) >>> at org.apache.kylin.dict.DictionaryGenerator.buildDictionary( >>> DictionaryGenerator.java:81) >>> at org.apache.kylin.dict.DictionaryManager.buildDictionary( >>> DictionaryManager.java:323) >>> at org.apache.kylin.cube.CubeManager.buildDictionary( >>> CubeManager.java:185) >>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>> processSegment(DictionaryGeneratorCLI.java:51) >>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>> processSegment(DictionaryGeneratorCLI.java:42) >>> at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run( >>> CreateDictionaryJob.java:56) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>> at org.apache.kylin.engine.mr.common.HadoopShellExecutable. >>> doWork(HadoopShellExecutable.java:63) >>> at org.apache.kylin.job.execution.AbstractExecutable. >>> execute(AbstractExecutable.java:112) >>> at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork( >>> DefaultChainedExecutable.java:57) >>> at org.apache.kylin.job.execution.AbstractExecutable. >>> execute(AbstractExecutable.java:112) >>> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$ >>> JobRunner.run(DefaultScheduler.java:127) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>> ThreadPoolExecutor.java:1145) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>> ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:744) >>> 2016-08-24 17:27:53,340 ERROR [pool-7-thread-10] >>> common.HadoopShellExecutable:65 : error execute HadoopShellExecutable{id= >>> 3a0f2751-dd2a-4a3b-a27a-58bfc0edbbfd-03, name=Build Dimension Dictionary, >>> state=RUNNING} >>> java.lang.RuntimeException >>> at org.apache.kylin.dict.CachedTreeMap.writeValue( >>> CachedTreeMap.java:240) >>> at org.apache.kylin.dict.CachedTreeMap.write( >>> CachedTreeMap.java:374) >>> at org.apache.kylin.dict.AppendTrieDictionary.flushIndex( >>> AppendTrieDictionary.java:1043) >>> at org.apache.kylin.dict.AppendTrieDictionary$Builder. >>> build(AppendTrieDictionary.java:954) >>> at org.apache.kylin.dict.GlobalDictionaryBuilder.build( >>> GlobalDictionaryBuilder.java:82) >>> at org.apache.kylin.dict.DictionaryGenerator.buildDictionary( >>> DictionaryGenerator.java:81) >>> at org.apache.kylin.dict.DictionaryManager.buildDictionary( >>> DictionaryManager.java:323) >>> at org.apache.kylin.cube.CubeManager.buildDictionary( >>> CubeManager.java:185) >>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>> processSegment(DictionaryGeneratorCLI.java:51) >>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>> processSegment(DictionaryGeneratorCLI.java:42) >>> at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run( >>> CreateDictionaryJob.java:56) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>> at org.apache.kylin.engine.mr.common.HadoopShellExecutable. >>> doWork(HadoopShellExecutable.java:63) >>> at org.apache.kylin.job.execution.AbstractExecutable. >>> execute(AbstractExecutable.java:112) >>> at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork( >>> DefaultChainedExecutable.java:57) >>> at org.apache.kylin.job.execution.AbstractExecutable. >>> execute(AbstractExecutable.java:112) >>> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$ >>> JobRunner.run(DefaultScheduler.java:127) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>> ThreadPoolExecutor.java:1145) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>> ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:744) >>> >>> and the error log in "kylin.out" : >>> >>> Aug 24, 2016 5:25:32 PM com.google.common.cache.LocalCache >>> processPendingNotifications >>> WARNING: Exception thrown by removal listener >>> java.lang.RuntimeException >>> at org.apache.kylin.dict.CachedTreeMap.writeValue( >>> CachedTreeMap.java:240) >>> at org.apache.kylin.dict.CachedTreeMap.access$300( >>> CachedTreeMap.java:52) >>> at org.apache.kylin.dict.CachedTreeMap$1.onRemoval( >>> CachedTreeMap.java:149) >>> at com.google.common.cache.LocalCache.processPendingNotifications( >>> LocalCache.java:2011) >>> at com.google.common.cache.LocalCache$Segment. >>> runUnlockedCleanup(LocalCache.java:3501) >>> at com.google.common.cache.LocalCache$Segment. >>> postWriteCleanup(LocalCache.java:3477) >>> at com.google.common.cache.LocalCache$Segment.put( >>> LocalCache.java:2940) >>> at com.google.common.cache.LocalCache.put(LocalCache.java:4202) >>> at com.google.common.cache.LocalCache$LocalManualCache. >>> put(LocalCache.java:4798) >>> at org.apache.kylin.dict.CachedTreeMap.put(CachedTreeMap.java:284) >>> at org.apache.kylin.dict.CachedTreeMap.put(CachedTreeMap.java:52) >>> at org.apache.kylin.dict.AppendTrieDictionary$Builder. >>> addValue(AppendTrieDictionary.java:829) >>> at org.apache.kylin.dict.AppendTrieDictionary$Builder. >>> addValue(AppendTrieDictionary.java:804) >>> at org.apache.kylin.dict.GlobalDictionaryBuilder.build( >>> GlobalDictionaryBuilder.java:78) >>> at org.apache.kylin.dict.DictionaryGenerator.buildDictionary( >>> DictionaryGenerator.java:81) >>> at org.apache.kylin.dict.DictionaryManager.buildDictionary( >>> DictionaryManager.java:323) >>> at org.apache.kylin.cube.CubeManager.buildDictionary( >>> CubeManager.java:185) >>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>> processSegment(DictionaryGeneratorCLI.java:51) >>> at org.apache.kylin.cube.cli.DictionaryGeneratorCLI. >>> processSegment(DictionaryGeneratorCLI.java:42) >>> at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run( >>> CreateDictionaryJob.java:56) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) >>> at org.apache.kylin.engine.mr.common.HadoopShellExecutable. >>> doWork(HadoopShellExecutable.java:63) >>> at org.apache.kylin.job.execution.AbstractExecutable. >>> execute(AbstractExecutable.java:112) >>> at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork( >>> DefaultChainedExecutable.java:57) >>> at org.apache.kylin.job.execution.AbstractExecutable. >>> execute(AbstractExecutable.java:112) >>> at org.apache.kylin.job.impl.threadpool.DefaultScheduler$ >>> JobRunner.run(DefaultScheduler.java:127) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>> ThreadPoolExecutor.java:1145) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>> ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:744) >>> >>> usage: CreateDictionaryJob >>> -cubename <cubename> Cube name. For exmaple, flat_item_cube >>> -input <input> Input path >>> -segmentname <segmentname> Cube segment name >>> >> >> >> >> -- >> With Warm regards >> >> Yiming Liu (??????)