What good news ! I wish you can release the version as quickly as possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge cube based on cuboid files, I think it will be very slowly..
2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <[email protected]>: > We have implemented the merge from HTable directly in Kylin 2.0, which > hasn’t been released/announced. > > On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote: > > >I think kylin can finish merging just depend on tables on hbase, This will > >make merging cubes more quickly, Isn't it ? > > > >2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>: > > > >> After check source code, I find you are right, cuboid files will be used > >> while merging segments, But a new question comes, Why kylin merge > >>segment > >> just based on hfile, I can not find how to take hbase table as input > >>format > >> of mapreduce job, But kylin take HFileOutputFormat as output format > >>while > >> changing cuboid to hfile. > >> > >> From this, I find kylin will take more space for a cube actually , not > >> only hfile but also cuboid files, the former are used for query and the > >> latter are used for merge, and the capacity of cuboid files is bigger > >>than > >> hfiles. > >> > >> I think we could do some thing to optimize it... I want to know your > >> opinions about it . > >> > >> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>: > >> > >>> Hi, yu feng, > >>> I’ve also noticed these files and opened a jira: > >>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch > >>> tonight. > >>> > >>> Here’s my opinions on your three question, feel free to correct me: > >>> > >>> First, the data path of intermediate hive table should be deleted > >>>after > >>> building, I agreed with that. > >>> > >>> Second, the cuboid files will be used for merge and will be deleted > >>> when merging job completed, we need and must leave them on hdfs. The > >>> fact_distint_columns should be deleted. In additionally, the path of > >>> rowkey_stats and hfile > >>> should also be deleted. > >>> > >>> Third, there’s no garbage collection steps if a job discard, maybe we > >>> need a patch for this. > >>> > >>> > >>> Short answer: > >>> KYLIN-978 will clean all hdfs path except cuboid files after buildJob > >>> and mergeJob completed. > >>> The hdfs path will not be cleanup if a job was discarded, we need > >>> improvement on this. > >>> > >>> > >>> Best Regards, > >>> Yerui Sun > >>> [email protected] > >>> > >>> > >>> > >>> > 在 2015年9月10日,18:20,yu feng <[email protected]> 写道: > >>> > > >>> > I see this core Improvement in release 1.0, JIRA url : > >>> > https://issues.apache.org/jira/browse/KYLIN-926 > >>> > > >>> > However, after my test and check the source code , I find some > >>> rubbish(I am not > >>> > sure) file in HDFS. > >>> > > >>> > First, kylin only drop the Intermediate table in hive, but the table > >>>is > >>> an > >>> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I > >>> check > >>> > that..) > >>> > > >>> > Second, the cuboid files take a large space in HDFS, and kylin do not > >>> > delete after the cube build(fact_distinct_columns files exist too). > >>>I am > >>> > not sure if those has other effects, remind me please if it has.. > >>> > > >>> > Third, After I discard a job, I think kylin should delete the > >>> Intermediate > >>> > files and drop Intermediate hive table, even though delete > >>> > them asynchronous. I think those data do not have any > >>>effects..remind me > >>> > please if it has.. > >>> > > >>> > These are rubbish datas still exist in current version(kylin-1.0), > >>> please > >>> > check, thanks.. > >>> > >>> > >> > >
