[ 
https://issues.apache.org/jira/browse/KYLIN-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Calaba updated KYLIN-1835:
----------------------------------
    Description: 
I believe I have discovered an error in Kylin realted to count_distinc with 
exact precission.

I am not 100% sure - but all points to the fact tha there is a design limit for 
count_distinct ... please assess / confirm / reject my observation.

Background info:
=============
- large fact table ~ 100 mio rows.
- large customer dimension ~ 10 mio rows

Defined 2 KPIs of type COUNT_DISTINCT - with exact precision (return type 
bitmap) on 2 high-cardinality fields of type Bigint (# of values expected for 
one measure max 15 000 000 distinct values ; 2nd measure can have more distinct 
values ~ approx. 50 mil (just an estimate). 

Error info:
========

Cube Build runs fine till #7 Step Name: Build Base Cuboid Data - where it 
errors out without further details in Kylin Log - it shows only "no counters 
for job job_1463699962519_16085".

The MR Logs of the job job_1463699962519_16085 sow exceptions:

2016-06-28 02:22:24,019 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.lang.NumberFormatException: For input string: 
"-6628245177096591402"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:495)
        at java.lang.Integer.parseInt(Integer.java:527)
        at 
org.apache.kylin.measure.bitmap.BitmapCounter.add(BitmapCounter.java:63)
        at 
org.apache.kylin.measure.bitmap.BitmapMeasureType$1.valueOf(BitmapMeasureType.java:106)
        at 
org.apache.kylin.measure.bitmap.BitmapMeasureType$1.valueOf(BitmapMeasureType.java:98)
        at 
org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.buildValueOf(BaseCuboidMapperBase.java:189)
        at 
org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.buildValue(BaseCuboidMapperBase.java:159)
        at 
org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.outputKV(BaseCuboidMapperBase.java:206)
        at 
org.apache.kylin.engine.mr.steps.HiveToBaseCuboidMapper.map(HiveToBaseCuboidMapper.java:53)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Just reading the signature of the exception and connecting the Measure 
precision return type "bitmap" => looks like that because I have chosen exact 
precision (which on UI says supported for int types) is causing this exception 
because I am passing Bigint field ???? 

If so -> is that a bug (refactory for big int needed) or is it design 
limitation ??? Cannot be the count_distinct implemented for bigint (with exact 
precision) or do I have to use count_distinct with error rate instead ???

In case I do not need to calculate the count_distinct for all dimensions 
combinations -  I might add some mandatory dimensions to the aggregation group 
- but not sure if this would resolve this issue (assuming I keep the exact 
precision counts) ... ???


  was:
I believe I have discovered an error in Kylin realted to count_distinc with 
exact precission.

I am not 100% sure - but all points to the fact tha there is a design limit dor 
count_distinct ... please assess / confirm / reject my observation.

Background info:
=============
- large fact table ~ 100 mio rows.
- large customer dimension ~ 10 mio rows

Defined 2 KPIs of type COUNT_DISTINCT - with exact precision (return type 
bitma) on 2 high-cardinality fields of type Bigint

Cube Build runs fine till #7 Step Name: Build Base Cuboid Data - where it 
errors out without further details in Kylin Log - it shows only "no counters 
for job job_1463699962519_16085".

The MR Logs of the job job_1463699962519_16085 sow exceptions:

2016-06-28 02:22:24,019 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.lang.NumberFormatException: For input string: 
"-6628245177096591402"
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:495)
        at java.lang.Integer.parseInt(Integer.java:527)
        at 
org.apache.kylin.measure.bitmap.BitmapCounter.add(BitmapCounter.java:63)
        at 
org.apache.kylin.measure.bitmap.BitmapMeasureType$1.valueOf(BitmapMeasureType.java:106)
        at 
org.apache.kylin.measure.bitmap.BitmapMeasureType$1.valueOf(BitmapMeasureType.java:98)
        at 
org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.buildValueOf(BaseCuboidMapperBase.java:189)
        at 
org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.buildValue(BaseCuboidMapperBase.java:159)
        at 
org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.outputKV(BaseCuboidMapperBase.java:206)
        at 
org.apache.kylin.engine.mr.steps.HiveToBaseCuboidMapper.map(HiveToBaseCuboidMapper.java:53)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

Just reading the signature of the exception and connecting the Measure 
precision return type "bitmap" => looks like that because I have chosen exact 
precision (which on UI says supported for int types) is causing this exception 
because I am passing Bigint field ???? 


If so -> is that a bug or design limitation ??? Cannot be the count_distinct 
implemented for bigint (with exact precision) or do I have to use 
count_distinct with error rate instead ???




> Error: java.lang.NumberFormatException: For input count_distinct on Big Int 
> ??? (#7 Step Name: Build Base Cuboid Data)
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: KYLIN-1835
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1835
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: v1.5.2, v1.5.2.1
>            Reporter: Richard Calaba
>            Priority: Critical
>
> I believe I have discovered an error in Kylin realted to count_distinc with 
> exact precission.
> I am not 100% sure - but all points to the fact tha there is a design limit 
> for count_distinct ... please assess / confirm / reject my observation.
> Background info:
> =============
> - large fact table ~ 100 mio rows.
> - large customer dimension ~ 10 mio rows
> Defined 2 KPIs of type COUNT_DISTINCT - with exact precision (return type 
> bitmap) on 2 high-cardinality fields of type Bigint (# of values expected for 
> one measure max 15 000 000 distinct values ; 2nd measure can have more 
> distinct values ~ approx. 50 mil (just an estimate). 
> Error info:
> ========
> Cube Build runs fine till #7 Step Name: Build Base Cuboid Data - where it 
> errors out without further details in Kylin Log - it shows only "no counters 
> for job job_1463699962519_16085".
> The MR Logs of the job job_1463699962519_16085 sow exceptions:
> 2016-06-28 02:22:24,019 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.NumberFormatException: For input string: 
> "-6628245177096591402"
>       at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>       at java.lang.Integer.parseInt(Integer.java:495)
>       at java.lang.Integer.parseInt(Integer.java:527)
>       at 
> org.apache.kylin.measure.bitmap.BitmapCounter.add(BitmapCounter.java:63)
>       at 
> org.apache.kylin.measure.bitmap.BitmapMeasureType$1.valueOf(BitmapMeasureType.java:106)
>       at 
> org.apache.kylin.measure.bitmap.BitmapMeasureType$1.valueOf(BitmapMeasureType.java:98)
>       at 
> org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.buildValueOf(BaseCuboidMapperBase.java:189)
>       at 
> org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.buildValue(BaseCuboidMapperBase.java:159)
>       at 
> org.apache.kylin.engine.mr.steps.BaseCuboidMapperBase.outputKV(BaseCuboidMapperBase.java:206)
>       at 
> org.apache.kylin.engine.mr.steps.HiveToBaseCuboidMapper.map(HiveToBaseCuboidMapper.java:53)
>       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:773)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:345)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1566)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> Just reading the signature of the exception and connecting the Measure 
> precision return type "bitmap" => looks like that because I have chosen exact 
> precision (which on UI says supported for int types) is causing this 
> exception because I am passing Bigint field ???? 
> If so -> is that a bug (refactory for big int needed) or is it design 
> limitation ??? Cannot be the count_distinct implemented for bigint (with 
> exact precision) or do I have to use count_distinct with error rate instead 
> ???
> In case I do not need to calculate the count_distinct for all dimensions 
> combinations -  I might add some mandatory dimensions to the aggregation 
> group - but not sure if this would resolve this issue (assuming I keep the 
> exact precision counts) ... ???



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to