[
https://issues.apache.org/jira/browse/KYLIN-4342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018765#comment-17018765
]
Xiaoxiang Yu commented on KYLIN-4342:
-------------------------------------
Great feature for apache kylin!
> Build Global Dict by MR/Hive New Version
> ----------------------------------------
>
> Key: KYLIN-4342
> URL: https://issues.apache.org/jira/browse/KYLIN-4342
> Project: Kylin
> Issue Type: Improvement
> Affects Versions: Future
> Reporter: wangxiaojing
> Assignee: wangxiaojing
> Priority: Major
>
> At present, there are two limitations and some distributed concurrency lock
> bugs in the implementation of global dictionary through MR/Hive:
> 1. Limited by Hive order by global sorting on the shuffle stage, the memory
> and build time becomes uncontrollable with data volume reaching billion
> level. We have tested the base of 800 million level to configure 15g memory,
> and the build time of build dictionary needs more than 10 hours;
> 2. Multi global dictionary columns is calculated serially.
> 3. Some distributed concurrency lock bugs.
> We have improved the original version.The general idea of the new version is
> the same as the previous Mr / Hive implementation, that is, to complete
> global dictionary coding through Hive or MR, and then replace the original
> value in the flat table with the dictionary encoded value.[Mr /Hive
> V1|[http://kylin.apache.org/docs30/howto/howto_use_hive_mr_dict.html]]
> However, in the new version, will add "parallel part build" and "parallel
> total build" two steps by mr to replace the original "build dict" step, so as
> to solve the above two limitations.And use ZK to solve the distributed
> concurrency lock bugs.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)