After check source code, I find you are right, cuboid files will be used while merging segments, But a new question comes, Why kylin merge segment just based on hfile, I can not find how to take hbase table as input format of mapreduce job, But kylin take HFileOutputFormat as output format while changing cuboid to hfile.
>From this, I find kylin will take more space for a cube actually , not only hfile but also cuboid files, the former are used for query and the latter are used for merge, and the capacity of cuboid files is bigger than hfiles. I think we could do some thing to optimize it... I want to know your opinions about it . 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>: > Hi, yu feng, > I’ve also noticed these files and opened a jira: > https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch > tonight. > > Here’s my opinions on your three question, feel free to correct me: > > First, the data path of intermediate hive table should be deleted after > building, I agreed with that. > > Second, the cuboid files will be used for merge and will be deleted when > merging job completed, we need and must leave them on hdfs. The > fact_distint_columns should be deleted. In additionally, the path of > rowkey_stats and hfile > should also be deleted. > > Third, there’s no garbage collection steps if a job discard, maybe we > need a patch for this. > > > Short answer: > KYLIN-978 will clean all hdfs path except cuboid files after buildJob > and mergeJob completed. > The hdfs path will not be cleanup if a job was discarded, we need > improvement on this. > > > Best Regards, > Yerui Sun > [email protected] > > > > > 在 2015年9月10日,18:20,yu feng <[email protected]> 写道: > > > > I see this core Improvement in release 1.0, JIRA url : > > https://issues.apache.org/jira/browse/KYLIN-926 > > > > However, after my test and check the source code , I find some rubbish(I > am not > > sure) file in HDFS. > > > > First, kylin only drop the Intermediate table in hive, but the table is > an > > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I > check > > that..) > > > > Second, the cuboid files take a large space in HDFS, and kylin do not > > delete after the cube build(fact_distinct_columns files exist too). I am > > not sure if those has other effects, remind me please if it has.. > > > > Third, After I discard a job, I think kylin should delete the > Intermediate > > files and drop Intermediate hive table, even though delete > > them asynchronous. I think those data do not have any effects..remind me > > please if it has.. > > > > These are rubbish datas still exist in current version(kylin-1.0), please > > check, thanks.. > >
