Re: rubbish files exist in HDFS

yu feng Thu, 10 Sep 2015 04:50:38 -0700

What good news !  I wish you can release the version as quickly as
possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge
cube based on cuboid files, I think it will be very slowly..


2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <[email protected]>:

> We have implemented the merge from HTable directly in Kylin 2.0, which
> hasn’t been released/announced.
>
> On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote:
>
> >I think kylin can finish merging just depend on tables on hbase, This will
> >make merging cubes more quickly, Isn't it ?
> >
> >2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:
> >
> >> After check source code, I find you are right, cuboid files will be used
> >> while merging segments, But a new question comes, Why kylin merge
> >>segment
> >> just based on hfile, I can not find how to take hbase table as input
> >>format
> >> of mapreduce job, But kylin take HFileOutputFormat as  output format
> >>while
> >> changing cuboid to hfile.
> >>
> >> From this, I find kylin will take more space for a cube actually , not
> >> only hfile but also cuboid files, the former are used for query and the
> >> latter are used for merge, and the capacity of cuboid files is bigger
> >>than
> >> hfiles.
> >>
> >> I think we could do some thing to optimize it... I want to know your
> >> opinions about it .
> >>
> >> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
> >>
> >>> Hi, yu feng,
> >>>   I’ve also noticed these files and opened a jira:
> >>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
> >>> tonight.
> >>>
> >>>   Here’s my opinions on your three question, feel free to correct me:
> >>>
> >>>   First, the data path of intermediate hive table should be deleted
> >>>after
> >>> building, I agreed with that.
> >>>
> >>>   Second, the cuboid files will be used for merge and will be deleted
> >>> when merging job completed, we need and must leave them on hdfs. The
> >>> fact_distint_columns should be deleted. In additionally, the path of
> >>> rowkey_stats and hfile
> >>> should also be deleted.
> >>>
> >>>   Third, there’s no garbage collection steps if a job discard, maybe we
> >>> need a patch for this.
> >>>
> >>>
> >>> Short answer:
> >>>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
> >>> and mergeJob completed.
> >>>   The hdfs path will not be cleanup if a job was discarded, we need
> >>> improvement on this.
> >>>
> >>>
> >>> Best Regards,
> >>> Yerui Sun
> >>> [email protected]
> >>>
> >>>
> >>>
> >>> > 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
> >>> >
> >>> > I see this core Improvement in release 1.0, JIRA url :
> >>> > https://issues.apache.org/jira/browse/KYLIN-926
> >>> >
> >>> > However, after my test and check the source code , I find some
> >>> rubbish(I am not
> >>> > sure) file in HDFS.
> >>> >
> >>> > First, kylin only drop the Intermediate table in hive, but the table
> >>>is
> >>> an
> >>> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
> >>> check
> >>> > that..)
> >>> >
> >>> > Second, the cuboid files take a large space in HDFS, and kylin do not
> >>> > delete after the cube build(fact_distinct_columns files exist too).
> >>>I am
> >>> > not sure if those has other effects, remind me please if it has..
> >>> >
> >>> > Third, After I discard a job, I think kylin should delete the
> >>> Intermediate
> >>> > files and drop Intermediate hive table, even though delete
> >>> > them asynchronous. I think those data do not have any
> >>>effects..remind me
> >>> > please if it has..
> >>> >
> >>> > These are rubbish datas still exist in current version(kylin-1.0),
> >>> please
> >>> > check, thanks..
> >>>
> >>>
> >>
>
>

Re: rubbish files exist in HDFS

Reply via email to