I think the solution should be work, let’s open an issue and resolve it later: https://issues.apache.org/jira/browse/KYLIN-1323
> 在 2016年1月15日,15:57,ShaoFeng Shi <shaofeng...@apache.org> 写道: > > For Meng's case, write 5GB takes 40 minutes, that's really slow. The > bottleneck should be on HDFS write (cuboid has been calculated, just > convert to HFile format in that step, no calculation and others). > > 2016-01-15 15:36 GMT+08:00 hongbin ma <mahong...@apache.org>: > >> if it works I'd love to see the change >> >> On Fri, Jan 15, 2016 at 3:35 PM, hongbin ma <mahong...@apache.org> wrote: >> >>> I'm not sure if it will work, does hbase bulk load allow that? >>> >>> On Fri, Jan 15, 2016 at 2:28 PM, Yerui Sun <sunye...@gmail.com> wrote: >>> >>>> hongbin, >>>> >>>> I understand how the number of reducers is determined, and it could be >>>> improved. >>>> >>>> Supposed that we got 100GB data after cuboid building, and with setting >>>> that 10GB per region. For now, 10 split keys was calculated, and 10 >> region >>>> created, 10 reducer used in ‘convert to hfile’ step. >>>> >>>> With optimization, we could calculate 100 (or more) split keys, and use >>>> all them in ‘covert to file’ step, but sampled 10 keys in them to create >>>> regions. The result is still 10 region created, but 100 reducer used in >>>> ‘convert to file’ step. Of course, the hfile created is also 100, and >> load >>>> 10 files per region. That’s should be fine, doesn’t affect the query >>>> performance dramatically. >>>> >>>>> 在 2016年1月15日,13:53,hongbin ma <mahong...@apache.org> 写道: >>>>> >>>>> hi, yerui, >>>>> >>>>> the reason why the number of "convert to hfile" reducers is small is >>>>> because each region's output will become a htable region. Too many >>>> regions >>>>> will be a burden to hbase cluster. In our production env we have cubes >>>> that >>>>> are 10T+, guess how many regions will it populate? >>>>> >>>>> What's more Kylin provides different profiles to control the expected >>>>> region size (thus controlling the number of regions & parallelism of >>>>> "create htable" reducer), you can modify it depending on your cube >>>> size. In >>>>> 2.x it's basically 10G for small cubes, 20G for medium cubes and 100G. >>>>> However this is a manual work when creating cube, and I admit the >> value >>>>> settings for the three profiles is still discussable. >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Jan 15, 2016 at 11:29 AM, Yerui Sun <sunye...@gmail.com> >> wrote: >>>>> >>>>>> Agreed with 梁猛. >>>>>> >>>>>> Actually we found the same issue, the number of reducers is too small >>>> in >>>>>> step ‘convert to hfile’, which is same as the region count. >>>>>> >>>>>> I think we could increase the number of reducers, to improve >>>> performance. >>>>>> If anyone has interesting in this, we could discuss more about the >>>> solution. >>>>>> >>>>>>> 在 2016年1月15日,09:46,13802880...@139.com 写道: >>>>>>> >>>>>>> actually,I found the last step " convert to hfile" take too much >>>> time, >>>>>> more than 40 minutes for single region(use small, and result file >>>> about 5GB) >>>>>>> >>>>>>> >>>>>>> >>>>>>> 中国移动广东有限公司 网管中心 梁猛 >>>>>>> 13802880...@139.com >>>>>>> >>>>>>> From: ShaoFeng Shi >>>>>>> Date: 2016-01-15 09:40 >>>>>>> To: dev >>>>>>> Subject: Re: beg suggestions to speed up the Kylin cube build >>>>>>> The cube build performance is much determined by your Hadoop >> cluster's >>>>>>> capacity. You can do some inspection with the MR job's statistics to >>>>>>> analysis the potential bottlenecks. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2016-01-15 7:19 GMT+08:00 zhong zhang <zzaco...@gmail.com>: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> We are trying to build a nine-dimension cube: >>>>>>>> eight mandatory dimensions and one hierarchy >>>>>>>> dimension. The fact table is like 20G. Two lookup >>>>>>>> tables are 1.3M and 357k separately. It takes like >>>>>>>> 3 hours to go to 30% progress which is kind of slow. >>>>>>>> >>>>>>>> We'd like to know are there suggestions to speed up >>>>>>>> the Kylin cube build. We got a suggestion from >>>>>>>> a slide said that sort the dimension based on the >>>>>>>> cardinality. Are there any other ways we can try? >>>>>>>> >>>>>>>> We also noticed that only half of the memory and >>>>>>>> half of the CPU are used during the cube build. >>>>>>>> Are there any ways to fully utilize the resource? >>>>>>>> >>>>>>>> Looking forward to hear from you. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Zhong >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best regards, >>>>>>> >>>>>>> Shaofeng Shi >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> >>>>> *Bin Mahone | 马洪宾* >>>>> Apache Kylin: http://kylin.io >>>>> Github: https://github.com/binmahone >>>> >>>> >>> >>> >>> -- >>> Regards, >>> >>> *Bin Mahone | 马洪宾* >>> Apache Kylin: http://kylin.io >>> Github: https://github.com/binmahone >>> >> >> >> >> -- >> Regards, >> >> *Bin Mahone | 马洪宾* >> Apache Kylin: http://kylin.io >> Github: https://github.com/binmahone >> > > > > -- > Best regards, > > Shaofeng Shi