You are right Yu, that files will be input as source during merge. It could be cleaned up after merge.
so that's actually just long-temporary files. Thanks. Best Regards! --------------------- Luke Han On Thu, Sep 10, 2015 at 7:22 PM, yu feng <[email protected]> wrote: > I think kylin can finish merging just depend on tables on hbase, This will > make merging cubes more quickly, Isn't it ? > > 2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>: > > > After check source code, I find you are right, cuboid files will be used > > while merging segments, But a new question comes, Why kylin merge segment > > just based on hfile, I can not find how to take hbase table as input > format > > of mapreduce job, But kylin take HFileOutputFormat as output format > while > > changing cuboid to hfile. > > > > From this, I find kylin will take more space for a cube actually , not > > only hfile but also cuboid files, the former are used for query and the > > latter are used for merge, and the capacity of cuboid files is bigger > than > > hfiles. > > > > I think we could do some thing to optimize it... I want to know your > > opinions about it . > > > > 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>: > > > >> Hi, yu feng, > >> I’ve also noticed these files and opened a jira: > >> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch > >> tonight. > >> > >> Here’s my opinions on your three question, feel free to correct me: > >> > >> First, the data path of intermediate hive table should be deleted > after > >> building, I agreed with that. > >> > >> Second, the cuboid files will be used for merge and will be deleted > >> when merging job completed, we need and must leave them on hdfs. The > >> fact_distint_columns should be deleted. In additionally, the path of > >> rowkey_stats and hfile > >> should also be deleted. > >> > >> Third, there’s no garbage collection steps if a job discard, maybe we > >> need a patch for this. > >> > >> > >> Short answer: > >> KYLIN-978 will clean all hdfs path except cuboid files after buildJob > >> and mergeJob completed. > >> The hdfs path will not be cleanup if a job was discarded, we need > >> improvement on this. > >> > >> > >> Best Regards, > >> Yerui Sun > >> [email protected] > >> > >> > >> > >> > 在 2015年9月10日,18:20,yu feng <[email protected]> 写道: > >> > > >> > I see this core Improvement in release 1.0, JIRA url : > >> > https://issues.apache.org/jira/browse/KYLIN-926 > >> > > >> > However, after my test and check the source code , I find some > >> rubbish(I am not > >> > sure) file in HDFS. > >> > > >> > First, kylin only drop the Intermediate table in hive, but the table > is > >> an > >> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I > >> check > >> > that..) > >> > > >> > Second, the cuboid files take a large space in HDFS, and kylin do not > >> > delete after the cube build(fact_distinct_columns files exist too). I > am > >> > not sure if those has other effects, remind me please if it has.. > >> > > >> > Third, After I discard a job, I think kylin should delete the > >> Intermediate > >> > files and drop Intermediate hive table, even though delete > >> > them asynchronous. I think those data do not have any effects..remind > me > >> > please if it has.. > >> > > >> > These are rubbish datas still exist in current version(kylin-1.0), > >> please > >> > check, thanks.. > >> > >> > > >
