Re: rubbish files exist in HDFS

yu feng Thu, 10 Sep 2015 09:56:00 -0700

OK, I find another problem(I am a problem maker, ^_^), today I buid this
cube which has 15 dimensions(one Mandatory dimension, to hierarchy
dimension and others are normal dimension), I find cuboid files are 1.9TB,
in the step of converting cuboid to hfile it is too slow. I check the log
of this job and find there are 9000+ mappers and only one reducer.


I discard this job when our hadoop administrator tells me the node witch
run this reducer is out of space of disk. I have to stop it, I am doubt
that why there are only one reducer(I do not check source code of this
job), By the way, my original data is only hundreds MB. I think this would
cause more problems if original is bigger or dimension is much more..

2015-09-10 23:46 GMT+08:00 Luke Han <[email protected]>:

> The 2.0 will not come recently, there are huge refactor and bunch of new
> features, we have to make sure there are no critical bugs before release.
>
> The same function also available under v1.x branch, please stay tuned for
> update information for that.
>
> Thanks.
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <[email protected]> wrote:
>
> > What good news !  I wish you can release the version as quickly as
> > possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge
> > cube based on cuboid files, I think it will be very slowly..
> >
> > 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <[email protected]>:
> >
> > > We have implemented the merge from HTable directly in Kylin 2.0, which
> > > hasn’t been released/announced.
> > >
> > > On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote:
> > >
> > > >I think kylin can finish merging just depend on tables on hbase, This
> > will
> > > >make merging cubes more quickly, Isn't it ?
> > > >
> > > >2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:
> > > >
> > > >> After check source code, I find you are right, cuboid files will be
> > used
> > > >> while merging segments, But a new question comes, Why kylin merge
> > > >>segment
> > > >> just based on hfile, I can not find how to take hbase table as input
> > > >>format
> > > >> of mapreduce job, But kylin take HFileOutputFormat as  output format
> > > >>while
> > > >> changing cuboid to hfile.
> > > >>
> > > >> From this, I find kylin will take more space for a cube actually ,
> not
> > > >> only hfile but also cuboid files, the former are used for query and
> > the
> > > >> latter are used for merge, and the capacity of cuboid files is
> bigger
> > > >>than
> > > >> hfiles.
> > > >>
> > > >> I think we could do some thing to optimize it... I want to know your
> > > >> opinions about it .
> > > >>
> > > >> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
> > > >>
> > > >>> Hi, yu feng,
> > > >>>   I’ve also noticed these files and opened a jira:
> > > >>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a
> > patch
> > > >>> tonight.
> > > >>>
> > > >>>   Here’s my opinions on your three question, feel free to correct
> me:
> > > >>>
> > > >>>   First, the data path of intermediate hive table should be deleted
> > > >>>after
> > > >>> building, I agreed with that.
> > > >>>
> > > >>>   Second, the cuboid files will be used for merge and will be
> deleted
> > > >>> when merging job completed, we need and must leave them on hdfs.
> The
> > > >>> fact_distint_columns should be deleted. In additionally, the path
> of
> > > >>> rowkey_stats and hfile
> > > >>> should also be deleted.
> > > >>>
> > > >>>   Third, there’s no garbage collection steps if a job discard,
> maybe
> > we
> > > >>> need a patch for this.
> > > >>>
> > > >>>
> > > >>> Short answer:
> > > >>>   KYLIN-978 will clean all hdfs path except cuboid files after
> > buildJob
> > > >>> and mergeJob completed.
> > > >>>   The hdfs path will not be cleanup if a job was discarded, we need
> > > >>> improvement on this.
> > > >>>
> > > >>>
> > > >>> Best Regards,
> > > >>> Yerui Sun
> > > >>> [email protected]
> > > >>>
> > > >>>
> > > >>>
> > > >>> > 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
> > > >>> >
> > > >>> > I see this core Improvement in release 1.0, JIRA url :
> > > >>> > https://issues.apache.org/jira/browse/KYLIN-926
> > > >>> >
> > > >>> > However, after my test and check the source code , I find some
> > > >>> rubbish(I am not
> > > >>> > sure) file in HDFS.
> > > >>> >
> > > >>> > First, kylin only drop the Intermediate table in hive, but the
> > table
> > > >>>is
> > > >>> an
> > > >>> > EXTERNAL table, the file still exist in kylin tmp directory in
> > HDFS(I
> > > >>> check
> > > >>> > that..)
> > > >>> >
> > > >>> > Second, the cuboid files take a large space in HDFS, and kylin do
> > not
> > > >>> > delete after the cube build(fact_distinct_columns files exist
> too).
> > > >>>I am
> > > >>> > not sure if those has other effects, remind me please if it has..
> > > >>> >
> > > >>> > Third, After I discard a job, I think kylin should delete the
> > > >>> Intermediate
> > > >>> > files and drop Intermediate hive table, even though delete
> > > >>> > them asynchronous. I think those data do not have any
> > > >>>effects..remind me
> > > >>> > please if it has..
> > > >>> >
> > > >>> > These are rubbish datas still exist in current
> version(kylin-1.0),
> > > >>> please
> > > >>> > check, thanks..
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: rubbish files exist in HDFS

Reply via email to