OK, I find another problem(I am a problem maker, ^_^), today I buid this cube which has 15 dimensions(one Mandatory dimension, to hierarchy dimension and others are normal dimension), I find cuboid files are 1.9TB, in the step of converting cuboid to hfile it is too slow. I check the log of this job and find there are 9000+ mappers and only one reducer.
I discard this job when our hadoop administrator tells me the node witch run this reducer is out of space of disk. I have to stop it, I am doubt that why there are only one reducer(I do not check source code of this job), By the way, my original data is only hundreds MB. I think this would cause more problems if original is bigger or dimension is much more.. 2015-09-10 23:46 GMT+08:00 Luke Han <[email protected]>: > The 2.0 will not come recently, there are huge refactor and bunch of new > features, we have to make sure there are no critical bugs before release. > > The same function also available under v1.x branch, please stay tuned for > update information for that. > > Thanks. > > > Best Regards! > --------------------- > > Luke Han > > On Thu, Sep 10, 2015 at 7:50 PM, yu feng <[email protected]> wrote: > > > What good news ! I wish you can release the version as quickly as > > possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge > > cube based on cuboid files, I think it will be very slowly.. > > > > 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <[email protected]>: > > > > > We have implemented the merge from HTable directly in Kylin 2.0, which > > > hasn’t been released/announced. > > > > > > On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote: > > > > > > >I think kylin can finish merging just depend on tables on hbase, This > > will > > > >make merging cubes more quickly, Isn't it ? > > > > > > > >2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>: > > > > > > > >> After check source code, I find you are right, cuboid files will be > > used > > > >> while merging segments, But a new question comes, Why kylin merge > > > >>segment > > > >> just based on hfile, I can not find how to take hbase table as input > > > >>format > > > >> of mapreduce job, But kylin take HFileOutputFormat as output format > > > >>while > > > >> changing cuboid to hfile. > > > >> > > > >> From this, I find kylin will take more space for a cube actually , > not > > > >> only hfile but also cuboid files, the former are used for query and > > the > > > >> latter are used for merge, and the capacity of cuboid files is > bigger > > > >>than > > > >> hfiles. > > > >> > > > >> I think we could do some thing to optimize it... I want to know your > > > >> opinions about it . > > > >> > > > >> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>: > > > >> > > > >>> Hi, yu feng, > > > >>> I’ve also noticed these files and opened a jira: > > > >>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a > > patch > > > >>> tonight. > > > >>> > > > >>> Here’s my opinions on your three question, feel free to correct > me: > > > >>> > > > >>> First, the data path of intermediate hive table should be deleted > > > >>>after > > > >>> building, I agreed with that. > > > >>> > > > >>> Second, the cuboid files will be used for merge and will be > deleted > > > >>> when merging job completed, we need and must leave them on hdfs. > The > > > >>> fact_distint_columns should be deleted. In additionally, the path > of > > > >>> rowkey_stats and hfile > > > >>> should also be deleted. > > > >>> > > > >>> Third, there’s no garbage collection steps if a job discard, > maybe > > we > > > >>> need a patch for this. > > > >>> > > > >>> > > > >>> Short answer: > > > >>> KYLIN-978 will clean all hdfs path except cuboid files after > > buildJob > > > >>> and mergeJob completed. > > > >>> The hdfs path will not be cleanup if a job was discarded, we need > > > >>> improvement on this. > > > >>> > > > >>> > > > >>> Best Regards, > > > >>> Yerui Sun > > > >>> [email protected] > > > >>> > > > >>> > > > >>> > > > >>> > 在 2015年9月10日,18:20,yu feng <[email protected]> 写道: > > > >>> > > > > >>> > I see this core Improvement in release 1.0, JIRA url : > > > >>> > https://issues.apache.org/jira/browse/KYLIN-926 > > > >>> > > > > >>> > However, after my test and check the source code , I find some > > > >>> rubbish(I am not > > > >>> > sure) file in HDFS. > > > >>> > > > > >>> > First, kylin only drop the Intermediate table in hive, but the > > table > > > >>>is > > > >>> an > > > >>> > EXTERNAL table, the file still exist in kylin tmp directory in > > HDFS(I > > > >>> check > > > >>> > that..) > > > >>> > > > > >>> > Second, the cuboid files take a large space in HDFS, and kylin do > > not > > > >>> > delete after the cube build(fact_distinct_columns files exist > too). > > > >>>I am > > > >>> > not sure if those has other effects, remind me please if it has.. > > > >>> > > > > >>> > Third, After I discard a job, I think kylin should delete the > > > >>> Intermediate > > > >>> > files and drop Intermediate hive table, even though delete > > > >>> > them asynchronous. I think those data do not have any > > > >>>effects..remind me > > > >>> > please if it has.. > > > >>> > > > > >>> > These are rubbish datas still exist in current > version(kylin-1.0), > > > >>> please > > > >>> > check, thanks.. > > > >>> > > > >>> > > > >> > > > > > > > > >
