I think kylin can finish merging just depend on tables on hbase, This will make merging cubes more quickly, Isn't it ?
2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>: > After check source code, I find you are right, cuboid files will be used > while merging segments, But a new question comes, Why kylin merge segment > just based on hfile, I can not find how to take hbase table as input format > of mapreduce job, But kylin take HFileOutputFormat as output format while > changing cuboid to hfile. > > From this, I find kylin will take more space for a cube actually , not > only hfile but also cuboid files, the former are used for query and the > latter are used for merge, and the capacity of cuboid files is bigger than > hfiles. > > I think we could do some thing to optimize it... I want to know your > opinions about it . > > 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>: > >> Hi, yu feng, >> I’ve also noticed these files and opened a jira: >> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch >> tonight. >> >> Here’s my opinions on your three question, feel free to correct me: >> >> First, the data path of intermediate hive table should be deleted after >> building, I agreed with that. >> >> Second, the cuboid files will be used for merge and will be deleted >> when merging job completed, we need and must leave them on hdfs. The >> fact_distint_columns should be deleted. In additionally, the path of >> rowkey_stats and hfile >> should also be deleted. >> >> Third, there’s no garbage collection steps if a job discard, maybe we >> need a patch for this. >> >> >> Short answer: >> KYLIN-978 will clean all hdfs path except cuboid files after buildJob >> and mergeJob completed. >> The hdfs path will not be cleanup if a job was discarded, we need >> improvement on this. >> >> >> Best Regards, >> Yerui Sun >> [email protected] >> >> >> >> > 在 2015年9月10日,18:20,yu feng <[email protected]> 写道: >> > >> > I see this core Improvement in release 1.0, JIRA url : >> > https://issues.apache.org/jira/browse/KYLIN-926 >> > >> > However, after my test and check the source code , I find some >> rubbish(I am not >> > sure) file in HDFS. >> > >> > First, kylin only drop the Intermediate table in hive, but the table is >> an >> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I >> check >> > that..) >> > >> > Second, the cuboid files take a large space in HDFS, and kylin do not >> > delete after the cube build(fact_distinct_columns files exist too). I am >> > not sure if those has other effects, remind me please if it has.. >> > >> > Third, After I discard a job, I think kylin should delete the >> Intermediate >> > files and drop Intermediate hive table, even though delete >> > them asynchronous. I think those data do not have any effects..remind me >> > please if it has.. >> > >> > These are rubbish datas still exist in current version(kylin-1.0), >> please >> > check, thanks.. >> >> >
