Re: Precisely Count Distinct on 100 million string values column

Yerui Sun Thu, 25 Aug 2016 01:23:11 -0700

That depends on your USER_ID carnality. I think your USER_ID should have 
duplicated values between segments, that??s why you use count **distinct**. If 
the USER_ID always different and show up only once, just count should be fine, 
no need to count **distinct**.


If the USER_ID carnality indeed over 2 billion, maybe you need create one cube 
every 21 days, and combine them into one hybrid cube? I??m not sure whether it 
worked, you can check http://kylin.apache.org/blog/2015/09/25/hybrid-model/  
and have a try. 

> ?? 2016??8??25????12:56??lxw <lxw1...@qq.com> ??????
> 
> Thanks, I got it.
> 
> We have 100 million new USER_IDs per day (segment), that means after 21 days, 
> the building task will be failed?
> And we can't use "Precisely Count Distinct" in out scene?
> 
> 
> 
> 
> 
> ------------------ ???????? ------------------
> ??????: "Yerui Sun";<sunye...@gmail.com>;
> ????????: 2016??8??25??(??????) ????11:55
> ??????: "dev"<dev@kylin.apache.org>; 
> 
> ????: Re: Precisely Count Distinct on 100 million string values column
> 
> 
> 
> lxw,
> If the values exceed Integer.MAX_VALUE, exception will be threw when 
> dictionary building.
> 
> You can firstly disable cube and then edit the json on web ui. The action 
> button is in the ??Admins?? of cube list table.
> 
> BTW, the 255 limitation could be removed in theory, however, that made the 
> logic more complicated. You can have a try and contribute the patch if 
> you??re interested.
> 
> Yiming,
> I will post a patch for more clearly exception message and some minor improve 
> of GlobalDictionary. 
> But maybe later, it??s quite a busy week... 
> 
>> ?? 2016??8??25????10:05??lxw <lxw1...@qq.com> ??????
>> 
>> Sorry, 
>> 
>> About question 1, 
>> I means if count distinct values of column data cross all segments exceed 
>> Integer.MAX_VALUE, what will be happened?
>> 
>> 
>> 
>> ------------------ ???????? ------------------
>> ??????: "lxw";<lxw1...@qq.com>;
>> ????????: 2016??8??25??(??????) ????10:01
>> ??????: "dev"<dev@kylin.apache.org>; 
>> 
>> ????: ?????? Precisely Count Distinct on 100 million string values column
>> 
>> 
>> 
>> I have 2 more questions:
>> 
>> 1. The capacity of the global dictionary is Integer.MAX_VALUE? If count 
>> distinct values of column data cross all segments, what will be happened? 
>> duplication or error ?
>> 
>> 2. Where I can manually edit a cube desc json? Now I use JAVA API to create 
>> or update cube.
>> 
>> Thanks!
>> 
>> 
>> 
>> ------------------ ???????? ------------------
>> ??????: "Yiming Liu";<liuyiming....@gmail.com>;
>> ????????: 2016??8??25??(??????) ????9:41
>> ??????: "dev"<dev@kylin.apache.org>; "sunyerui"<sunye...@gmail.com>; 
>> 
>> ????: Re: Precisely Count Distinct on 100 million string values column
>> 
>> 
>> 
>> Good found.
>> 
>> The code AppendTrieDictionary line 604:
>> 
>> // nValueBytes
>> if (n.part.length > 255)
>>   throw new RuntimeException();
>> 
>> Hi Yerui,
>> 
>> Could you add more comments for the 255 limit, with more meaningful 
>> exception?
>> 
>> 
>> 2016-08-24 20:44 GMT+08:00 lxw <lxw1...@qq.com>:
>> 
>>> It caused by length(USER_ID) > 255.
>>> After exclude these dirty data, it works .
>>> 
>>> 
>>> Total 150 million records, execute this query:
>>> 
>>> select city_code,
>>> sum(bid_request) as bid_request,
>>> count(distinct user_id) as uv
>>> from liuxiaowen.TEST_T_PBS_UV_FACT
>>> group by city_code
>>> order by uv desc limit 100
>>> 
>>> Kylin cost  7 seconds, and Hive cost 180 seconds, the result is same.
>>> 
>>> 
>>> 
>>> ------------------ Original ------------------
>>> From:  "lxw";<lxw1...@qq.com>;
>>> Date:  Wed, Aug 24, 2016 05:27 PM
>>> To:  "dev"<dev@kylin.apache.org>;
>>> 
>>> Subject:  Precisely Count Distinct on 100 million string values column
>>> 
>>> 
>>> 
>>> Hi,
>>> 
>>>   I am trying to use "Precisely Count Distinct" on 100 million string
>>> values column "USER_ID", I updated the cube json :
>>> "dictionaries": [     {       "column": "USER_ID",       "builder":
>>> "org.apache.kylin.dict.GlobalDictionaryBuilder"     }   ],
>>> 
>>> "override_kylin_properties": {     
>>> "kylin.job.mr.config.override.mapred.map.child.java.opts":
>>> "-Xmx7g",     "kylin.job.mr.config.override.mapreduce.map.memory.mb":
>>> "7168"   }  when I build the cube, an error occurred on "#4 Step Name:
>>> Build Dimension Dictionary",
>>> the error log in "kylin.log" :
>>> 
>>> 2016-08-24 17:27:53,282 ERROR [pool-7-thread-10] dict.CachedTreeMap:239 :
>>> write value into /kylin_test1/kylin_metadata_test1/resources/GlobalDict/
>>> dict/LIUXIAOWEN.TEST_T_PBS_UV_FACT/USER_ID.tmp/cached_
>>> AQEByQXVzFd8r0YviP4x84YqUv-NcRiuCI2d exception: java.lang.RuntimeException
>>> java.lang.RuntimeException
>>>       at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
>>> build_writeNode(AppendTrieDictionary.java:605)
>>>       at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
>>> buildTrieBytes(AppendTrieDictionary.java:576)
>>>       at org.apache.kylin.dict.AppendTrieDictionary$DictNode.
>>> write(AppendTrieDictionary.java:523)
>>>       at org.apache.kylin.dict.CachedTreeMap.writeValue(
>>> CachedTreeMap.java:234)
>>>       at org.apache.kylin.dict.CachedTreeMap.write(
>>> CachedTreeMap.java:374)
>>>       at org.apache.kylin.dict.AppendTrieDictionary.flushIndex(
>>> AppendTrieDictionary.java:1043)
>>>       at org.apache.kylin.dict.AppendTrieDictionary$Builder.
>>> build(AppendTrieDictionary.java:954)
>>>       at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
>>> GlobalDictionaryBuilder.java:82)
>>>       at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
>>> DictionaryGenerator.java:81)
>>>       at org.apache.kylin.dict.DictionaryManager.buildDictionary(
>>> DictionaryManager.java:323)
>>>       at org.apache.kylin.cube.CubeManager.buildDictionary(
>>> CubeManager.java:185)
>>>       at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>> processSegment(DictionaryGeneratorCLI.java:51)
>>>       at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>> processSegment(DictionaryGeneratorCLI.java:42)
>>>       at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
>>> CreateDictionaryJob.java:56)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>       at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
>>> doWork(HadoopShellExecutable.java:63)
>>>       at org.apache.kylin.job.execution.AbstractExecutable.
>>> execute(AbstractExecutable.java:112)
>>>       at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(
>>> DefaultChainedExecutable.java:57)
>>>       at org.apache.kylin.job.execution.AbstractExecutable.
>>> execute(AbstractExecutable.java:112)
>>>       at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
>>> JobRunner.run(DefaultScheduler.java:127)
>>>       at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1145)
>>>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:615)
>>>       at java.lang.Thread.run(Thread.java:744)
>>> 2016-08-24 17:27:53,340 ERROR [pool-7-thread-10]
>>> common.HadoopShellExecutable:65 : error execute HadoopShellExecutable{id=
>>> 3a0f2751-dd2a-4a3b-a27a-58bfc0edbbfd-03, name=Build Dimension Dictionary,
>>> state=RUNNING}
>>> java.lang.RuntimeException
>>>       at org.apache.kylin.dict.CachedTreeMap.writeValue(
>>> CachedTreeMap.java:240)
>>>       at org.apache.kylin.dict.CachedTreeMap.write(
>>> CachedTreeMap.java:374)
>>>       at org.apache.kylin.dict.AppendTrieDictionary.flushIndex(
>>> AppendTrieDictionary.java:1043)
>>>       at org.apache.kylin.dict.AppendTrieDictionary$Builder.
>>> build(AppendTrieDictionary.java:954)
>>>       at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
>>> GlobalDictionaryBuilder.java:82)
>>>       at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
>>> DictionaryGenerator.java:81)
>>>       at org.apache.kylin.dict.DictionaryManager.buildDictionary(
>>> DictionaryManager.java:323)
>>>       at org.apache.kylin.cube.CubeManager.buildDictionary(
>>> CubeManager.java:185)
>>>       at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>> processSegment(DictionaryGeneratorCLI.java:51)
>>>       at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>> processSegment(DictionaryGeneratorCLI.java:42)
>>>       at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
>>> CreateDictionaryJob.java:56)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>       at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
>>> doWork(HadoopShellExecutable.java:63)
>>>       at org.apache.kylin.job.execution.AbstractExecutable.
>>> execute(AbstractExecutable.java:112)
>>>       at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(
>>> DefaultChainedExecutable.java:57)
>>>       at org.apache.kylin.job.execution.AbstractExecutable.
>>> execute(AbstractExecutable.java:112)
>>>       at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
>>> JobRunner.run(DefaultScheduler.java:127)
>>>       at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1145)
>>>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:615)
>>>       at java.lang.Thread.run(Thread.java:744)
>>> 
>>>   and the error log in "kylin.out" :
>>> 
>>> Aug 24, 2016 5:25:32 PM com.google.common.cache.LocalCache
>>> processPendingNotifications
>>> WARNING: Exception thrown by removal listener
>>> java.lang.RuntimeException
>>>       at org.apache.kylin.dict.CachedTreeMap.writeValue(
>>> CachedTreeMap.java:240)
>>>       at org.apache.kylin.dict.CachedTreeMap.access$300(
>>> CachedTreeMap.java:52)
>>>       at org.apache.kylin.dict.CachedTreeMap$1.onRemoval(
>>> CachedTreeMap.java:149)
>>>       at com.google.common.cache.LocalCache.processPendingNotifications(
>>> LocalCache.java:2011)
>>>       at com.google.common.cache.LocalCache$Segment.
>>> runUnlockedCleanup(LocalCache.java:3501)
>>>       at com.google.common.cache.LocalCache$Segment.
>>> postWriteCleanup(LocalCache.java:3477)
>>>       at com.google.common.cache.LocalCache$Segment.put(
>>> LocalCache.java:2940)
>>>       at com.google.common.cache.LocalCache.put(LocalCache.java:4202)
>>>       at com.google.common.cache.LocalCache$LocalManualCache.
>>> put(LocalCache.java:4798)
>>>       at org.apache.kylin.dict.CachedTreeMap.put(CachedTreeMap.java:284)
>>>       at org.apache.kylin.dict.CachedTreeMap.put(CachedTreeMap.java:52)
>>>       at org.apache.kylin.dict.AppendTrieDictionary$Builder.
>>> addValue(AppendTrieDictionary.java:829)
>>>       at org.apache.kylin.dict.AppendTrieDictionary$Builder.
>>> addValue(AppendTrieDictionary.java:804)
>>>       at org.apache.kylin.dict.GlobalDictionaryBuilder.build(
>>> GlobalDictionaryBuilder.java:78)
>>>       at org.apache.kylin.dict.DictionaryGenerator.buildDictionary(
>>> DictionaryGenerator.java:81)
>>>       at org.apache.kylin.dict.DictionaryManager.buildDictionary(
>>> DictionaryManager.java:323)
>>>       at org.apache.kylin.cube.CubeManager.buildDictionary(
>>> CubeManager.java:185)
>>>       at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>> processSegment(DictionaryGeneratorCLI.java:51)
>>>       at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.
>>> processSegment(DictionaryGeneratorCLI.java:42)
>>>       at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(
>>> CreateDictionaryJob.java:56)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>>>       at org.apache.kylin.engine.mr.common.HadoopShellExecutable.
>>> doWork(HadoopShellExecutable.java:63)
>>>       at org.apache.kylin.job.execution.AbstractExecutable.
>>> execute(AbstractExecutable.java:112)
>>>       at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(
>>> DefaultChainedExecutable.java:57)
>>>       at org.apache.kylin.job.execution.AbstractExecutable.
>>> execute(AbstractExecutable.java:112)
>>>       at org.apache.kylin.job.impl.threadpool.DefaultScheduler$
>>> JobRunner.run(DefaultScheduler.java:127)
>>>       at java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1145)
>>>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:615)
>>>       at java.lang.Thread.run(Thread.java:744)
>>> 
>>> usage: CreateDictionaryJob
>>> -cubename <cubename>         Cube name. For exmaple, flat_item_cube
>>> -input <input>               Input path
>>> -segmentname <segmentname>   Cube segment name
>>> 
>> 
>> 
>> 
>> -- 
>> With Warm regards
>> 
>> Yiming Liu (??????)

Re: Precisely Count Distinct on 100 million string values column

Reply via email to