Re: rubbish files exist in HDFS

Luke Han Thu, 10 Sep 2015 04:33:02 -0700

You are right Yu, that files will be input as source during merge.

It could be cleaned up after merge.


so that's actually just long-temporary files.

Thanks.


Best Regards!
---------------------

Luke Han

On Thu, Sep 10, 2015 at 7:22 PM, yu feng <[email protected]> wrote:

> I think kylin can finish merging just depend on tables on hbase, This will
> make merging cubes more quickly, Isn't it ?
>
> 2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:
>
> > After check source code, I find you are right, cuboid files will be used
> > while merging segments, But a new question comes, Why kylin merge segment
> > just based on hfile, I can not find how to take hbase table as input
> format
> > of mapreduce job, But kylin take HFileOutputFormat as  output format
> while
> > changing cuboid to hfile.
> >
> > From this, I find kylin will take more space for a cube actually , not
> > only hfile but also cuboid files, the former are used for query and the
> > latter are used for merge, and the capacity of cuboid files is bigger
> than
> > hfiles.
> >
> > I think we could do some thing to optimize it... I want to know your
> > opinions about it .
> >
> > 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
> >
> >> Hi, yu feng,
> >>   I’ve also noticed these files and opened a jira:
> >> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
> >> tonight.
> >>
> >>   Here’s my opinions on your three question, feel free to correct me:
> >>
> >>   First, the data path of intermediate hive table should be deleted
> after
> >> building, I agreed with that.
> >>
> >>   Second, the cuboid files will be used for merge and will be deleted
> >> when merging job completed, we need and must leave them on hdfs. The
> >> fact_distint_columns should be deleted. In additionally, the path of
> >> rowkey_stats and hfile
> >> should also be deleted.
> >>
> >>   Third, there’s no garbage collection steps if a job discard, maybe we
> >> need a patch for this.
> >>
> >>
> >> Short answer:
> >>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
> >> and mergeJob completed.
> >>   The hdfs path will not be cleanup if a job was discarded, we need
> >> improvement on this.
> >>
> >>
> >> Best Regards,
> >> Yerui Sun
> >> [email protected]
> >>
> >>
> >>
> >> > 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
> >> >
> >> > I see this core Improvement in release 1.0, JIRA url :
> >> > https://issues.apache.org/jira/browse/KYLIN-926
> >> >
> >> > However, after my test and check the source code , I find some
> >> rubbish(I am not
> >> > sure) file in HDFS.
> >> >
> >> > First, kylin only drop the Intermediate table in hive, but the table
> is
> >> an
> >> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
> >> check
> >> > that..)
> >> >
> >> > Second, the cuboid files take a large space in HDFS, and kylin do not
> >> > delete after the cube build(fact_distinct_columns files exist too). I
> am
> >> > not sure if those has other effects, remind me please if it has..
> >> >
> >> > Third, After I discard a job, I think kylin should delete the
> >> Intermediate
> >> > files and drop Intermediate hive table, even though delete
> >> > them asynchronous. I think those data do not have any effects..remind
> me
> >> > please if it has..
> >> >
> >> > These are rubbish datas still exist in current version(kylin-1.0),
> >> please
> >> > check, thanks..
> >>
> >>
> >
>

Re: rubbish files exist in HDFS

Reply via email to