Re: rubbish files exist in HDFS

Abhilash L L Tue, 29 Sep 2015 09:55:42 -0700

Hello,

    We observered that purging and dropping a cube is not deleting
dictionaries / snapshots and also not dropping the table in hbase.


    Also, its leaving a lot of temporary data in hdfs

    We are on 0.7.2.  I hope it is being fixed shortly and on priority.

     I saw that the ticket has been fixed on v1.1 and v2. Can this be back
ported to 0.7.2


Regards,
Abhilash

On Fri, Sep 18, 2015 at 11:02 AM, yu feng <[email protected]> wrote:

> After build another cube successfully, I recheck this bug and find the
> reason, thanks to all of you ...
>
> 2015-09-11 11:17 GMT+08:00 ShaoFeng Shi <[email protected]>:
>
> > If "rowkey_stats" wasn't found, Kylin should throw exception and exit,
> > instead of using 1 region silently; I'm going to change this, please let
> me
> > know if you don't agree.
> >
> > 2015-09-11 10:17 GMT+08:00 Yerui Sun <[email protected]>:
> >
> > > Hi, yu feng,
> > >   Let me guess the reason of your problem.
> > >
> > >   The num of reducers of converting hfile job depends on the region
> > > numbers of corresponding HTable.
> > >
> > >   For now, all HTables were created with only one region, caused by the
> > > wrong path of rowkey_stats. I’ve opened a jira for this issue:
> > > https://issues.apache.org/jira/browse/KYLIN-968. The patch has been
> > > available last night.
> > >
> > >   Here’s some clues to confirm my guessing:
> > >   1. You can find the corresponding HTable name in log, check its
> > regions,
> > > it should have only one region.
> > >   2. Check your kylin working directory on hdfs, there should be a path
> > > like ‘../kylin-null/../rowkey_stats'.
> > >   3. Grep your kylin.log in tomcat dir, you should find the log
> contains
> > > ‘no region split, HTable will be one region’.
> > >
> > >   If you hit all the three clues, I think KYLIN-968 could resolve your
> > > problem.
> > >
> > >
> > > > 在 2015年9月11日，00:54，yu feng <[email protected]> 写道：
> > > >
> > > > OK, I find another problem(I am a problem maker, ^_^), today I buid
> > this
> > > > cube which has 15 dimensions(one Mandatory dimension, to hierarchy
> > > > dimension and others are normal dimension), I find cuboid files are
> > > 1.9TB,
> > > > in the step of converting cuboid to hfile it is too slow. I check the
> > log
> > > > of this job and find there are 9000+ mappers and only one reducer.
> > > >
> > > > I discard this job when our hadoop administrator tells me the node
> > witch
> > > > run this reducer is out of space of disk. I have to stop it, I am
> doubt
> > > > that why there are only one reducer(I do not check source code of
> this
> > > > job), By the way, my original data is only hundreds MB. I think this
> > > would
> > > > cause more problems if original is bigger or dimension is much more..
> > > >
> > > > 2015-09-10 23:46 GMT+08:00 Luke Han <[email protected]>:
> > > >
> > > >> The 2.0 will not come recently, there are huge refactor and bunch of
> > new
> > > >> features, we have to make sure there are no critical bugs before
> > > release.
> > > >>
> > > >> The same function also available under v1.x branch, please stay
> tuned
> > > for
> > > >> update information for that.
> > > >>
> > > >> Thanks.
> > > >>
> > > >>
> > > >> Best Regards!
> > > >> ---------------------
> > > >>
> > > >> Luke Han
> > > >>
> > > >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <[email protected]>
> > wrote:
> > > >>
> > > >>> What good news !  I wish you can release the version as quickly as
> > > >>> possible, Today, I build a cube whose cuboid files is 1.9TB. If we
> > > merge
> > > >>> cube based on cuboid files, I think it will be very slowly..
> > > >>>
> > > >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <[email protected]>:
> > > >>>
> > > >>>> We have implemented the merge from HTable directly in Kylin 2.0,
> > which
> > > >>>> hasn’t been released/announced.
> > > >>>>
> > > >>>> On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote:
> > > >>>>
> > > >>>>> I think kylin can finish merging just depend on tables on hbase,
> > This
> > > >>> will
> > > >>>>> make merging cubes more quickly, Isn't it ?
> > > >>>>>
> > > >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:
> > > >>>>>
> > > >>>>>> After check source code, I find you are right, cuboid files will
> > be
> > > >>> used
> > > >>>>>> while merging segments, But a new question comes, Why kylin
> merge
> > > >>>>>> segment
> > > >>>>>> just based on hfile, I can not find how to take hbase table as
> > input
> > > >>>>>> format
> > > >>>>>> of mapreduce job, But kylin take HFileOutputFormat as  output
> > format
> > > >>>>>> while
> > > >>>>>> changing cuboid to hfile.
> > > >>>>>>
> > > >>>>>> From this, I find kylin will take more space for a cube
> actually ,
> > > >> not
> > > >>>>>> only hfile but also cuboid files, the former are used for query
> > and
> > > >>> the
> > > >>>>>> latter are used for merge, and the capacity of cuboid files is
> > > >> bigger
> > > >>>>>> than
> > > >>>>>> hfiles.
> > > >>>>>>
> > > >>>>>> I think we could do some thing to optimize it... I want to know
> > your
> > > >>>>>> opinions about it .
> > > >>>>>>
> > > >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
> > > >>>>>>
> > > >>>>>>> Hi, yu feng,
> > > >>>>>>>  I’ve also noticed these files and opened a jira:
> > > >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll
> post a
> > > >>> patch
> > > >>>>>>> tonight.
> > > >>>>>>>
> > > >>>>>>>  Here’s my opinions on your three question, feel free to
> correct
> > > >> me:
> > > >>>>>>>
> > > >>>>>>>  First, the data path of intermediate hive table should be
> > deleted
> > > >>>>>>> after
> > > >>>>>>> building, I agreed with that.
> > > >>>>>>>
> > > >>>>>>>  Second, the cuboid files will be used for merge and will be
> > > >> deleted
> > > >>>>>>> when merging job completed, we need and must leave them on
> hdfs.
> > > >> The
> > > >>>>>>> fact_distint_columns should be deleted. In additionally, the
> path
> > > >> of
> > > >>>>>>> rowkey_stats and hfile
> > > >>>>>>> should also be deleted.
> > > >>>>>>>
> > > >>>>>>>  Third, there’s no garbage collection steps if a job discard,
> > > >> maybe
> > > >>> we
> > > >>>>>>> need a patch for this.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Short answer:
> > > >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files after
> > > >>> buildJob
> > > >>>>>>> and mergeJob completed.
> > > >>>>>>>  The hdfs path will not be cleanup if a job was discarded, we
> > need
> > > >>>>>>> improvement on this.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Best Regards,
> > > >>>>>>> Yerui Sun
> > > >>>>>>> [email protected]
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>> 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
> > > >>>>>>>>
> > > >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
> > > >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
> > > >>>>>>>>
> > > >>>>>>>> However, after my test and check the source code , I find some
> > > >>>>>>> rubbish(I am not
> > > >>>>>>>> sure) file in HDFS.
> > > >>>>>>>>
> > > >>>>>>>> First, kylin only drop the Intermediate table in hive, but the
> > > >>> table
> > > >>>>>>> is
> > > >>>>>>> an
> > > >>>>>>>> EXTERNAL table, the file still exist in kylin tmp directory in
> > > >>> HDFS(I
> > > >>>>>>> check
> > > >>>>>>>> that..)
> > > >>>>>>>>
> > > >>>>>>>> Second, the cuboid files take a large space in HDFS, and kylin
> > do
> > > >>> not
> > > >>>>>>>> delete after the cube build(fact_distinct_columns files exist
> > > >> too).
> > > >>>>>>> I am
> > > >>>>>>>> not sure if those has other effects, remind me please if it
> > has..
> > > >>>>>>>>
> > > >>>>>>>> Third, After I discard a job, I think kylin should delete the
> > > >>>>>>> Intermediate
> > > >>>>>>>> files and drop Intermediate hive table, even though delete
> > > >>>>>>>> them asynchronous. I think those data do not have any
> > > >>>>>>> effects..remind me
> > > >>>>>>>> please if it has..
> > > >>>>>>>>
> > > >>>>>>>> These are rubbish datas still exist in current
> > > >> version(kylin-1.0),
> > > >>>>>>> please
> > > >>>>>>>> check, thanks..
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: rubbish files exist in HDFS

Reply via email to