We have implemented the merge from HTable directly in Kylin 2.0, which hasn’t been released/announced.
On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote: >I think kylin can finish merging just depend on tables on hbase, This will >make merging cubes more quickly, Isn't it ? > >2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>: > >> After check source code, I find you are right, cuboid files will be used >> while merging segments, But a new question comes, Why kylin merge >>segment >> just based on hfile, I can not find how to take hbase table as input >>format >> of mapreduce job, But kylin take HFileOutputFormat as output format >>while >> changing cuboid to hfile. >> >> From this, I find kylin will take more space for a cube actually , not >> only hfile but also cuboid files, the former are used for query and the >> latter are used for merge, and the capacity of cuboid files is bigger >>than >> hfiles. >> >> I think we could do some thing to optimize it... I want to know your >> opinions about it . >> >> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>: >> >>> Hi, yu feng, >>> I’ve also noticed these files and opened a jira: >>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch >>> tonight. >>> >>> Here’s my opinions on your three question, feel free to correct me: >>> >>> First, the data path of intermediate hive table should be deleted >>>after >>> building, I agreed with that. >>> >>> Second, the cuboid files will be used for merge and will be deleted >>> when merging job completed, we need and must leave them on hdfs. The >>> fact_distint_columns should be deleted. In additionally, the path of >>> rowkey_stats and hfile >>> should also be deleted. >>> >>> Third, there’s no garbage collection steps if a job discard, maybe we >>> need a patch for this. >>> >>> >>> Short answer: >>> KYLIN-978 will clean all hdfs path except cuboid files after buildJob >>> and mergeJob completed. >>> The hdfs path will not be cleanup if a job was discarded, we need >>> improvement on this. >>> >>> >>> Best Regards, >>> Yerui Sun >>> [email protected] >>> >>> >>> >>> > 在 2015年9月10日,18:20,yu feng <[email protected]> 写道: >>> > >>> > I see this core Improvement in release 1.0, JIRA url : >>> > https://issues.apache.org/jira/browse/KYLIN-926 >>> > >>> > However, after my test and check the source code , I find some >>> rubbish(I am not >>> > sure) file in HDFS. >>> > >>> > First, kylin only drop the Intermediate table in hive, but the table >>>is >>> an >>> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I >>> check >>> > that..) >>> > >>> > Second, the cuboid files take a large space in HDFS, and kylin do not >>> > delete after the cube build(fact_distinct_columns files exist too). >>>I am >>> > not sure if those has other effects, remind me please if it has.. >>> > >>> > Third, After I discard a job, I think kylin should delete the >>> Intermediate >>> > files and drop Intermediate hive table, even though delete >>> > them asynchronous. I think those data do not have any >>>effects..remind me >>> > please if it has.. >>> > >>> > These are rubbish datas still exist in current version(kylin-1.0), >>> please >>> > check, thanks.. >>> >>> >>
