[
https://issues.apache.org/jira/browse/KYLIN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591116#comment-16591116
]
Shaofeng SHI commented on KYLIN-3491:
-------------------------------------
Yanghong, this patch can pass IT, but it didn't take effective in any CI cube.
I manually enabled this feature in the "ci_left_join_cube" cube with property:
{code:java}
"kylin.dictionary.shrunken-from-global-enabled": "true"
{code}
Then CI failed at "Kylin_Extract_Dictionary_from_Global_ci_left_join_cube_Step"
step, the map reported error:
{code:java}
Error: java.lang.NullPointerException at
org.apache.kylin.dict.ShrunkenDictionaryBuilder.build(ShrunkenDictionaryBuilder.java:46)
at
org.apache.kylin.engine.mr.steps.ExtractDictionaryFromGlobalMapper.doCleanup(ExtractDictionaryFromGlobalMapper.java:130)
at org.apache.kylin.engine.mr.KylinMapper.cleanup(KylinMapper.java:103) at
org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:149) at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
{code}
> Improve the cube building process when using global dictionary
> --------------------------------------------------------------
>
> Key: KYLIN-3491
> URL: https://issues.apache.org/jira/browse/KYLIN-3491
> Project: Kylin
> Issue Type: Improvement
> Components: Job Engine
> Reporter: Zhong Yanghong
> Assignee: Zhong Yanghong
> Priority: Major
> Fix For: v2.5.0
>
> Attachments: APACHE-KYLIN-3491.patch
>
>
> By current cubing process, if the global dictionary is very large, since the
> raw data records are unsorted, it's hard to encode raw values into ids for
> the input of bitmap due to frequent swap of the dictionary slices. We need a
> refined process. The idea is as follows:
> # for each source data block, there will be a mapper generating the distinct
> values & sort them
> # encode the sorted distinct values and generate a shrunken dict for each
> source data block.
> # when building base cuboid, use the shrunken dict for each source data
> block for encoding.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)