Re: rubbish files exist in HDFS

Shi, Shaofeng Tue, 29 Sep 2015 18:09:07 -0700

for v1.0 or before, please refer to this doc to do manual cleanup:

https://kylin.incubator.apache.org/docs/howto/howto_cleanup_storage.html




On 9/30/15, 9:00 AM, "Luke Han" <[email protected]> wrote:

>Hi Abhilash,
>   I would like to recommend to upgrade to v1.0 or v1.1 (is under
>releasing
>process now).
>
>   Thanks.
>Luke
>
>
>Best Regards!
>---------------------
>
>Luke Han
>
>On Wed, Sep 30, 2015 at 12:46 AM, Abhilash L L <[email protected]>
>wrote:
>
>> Hello,
>>
>>     We observered that purging and dropping a cube is not deleting
>> dictionaries / snapshots and also not dropping the table in hbase.
>>
>>     Also, its leaving a lot of temporary data in hdfs
>>
>>     We are on 0.7.2.  I hope it is being fixed shortly and on priority.
>>
>>      I saw that the ticket has been fixed on v1.1 and v2. Can this be
>>back
>> ported to 0.7.2
>>
>>
>> Regards,
>> Abhilash
>>
>> On Fri, Sep 18, 2015 at 11:02 AM, yu feng <[email protected]> wrote:
>>
>> > After build another cube successfully, I recheck this bug and find the
>> > reason, thanks to all of you ...
>> >
>> > 2015-09-11 11:17 GMT+08:00 ShaoFeng Shi <[email protected]>:
>> >
>> > > If "rowkey_stats" wasn't found, Kylin should throw exception and
>>exit,
>> > > instead of using 1 region silently; I'm going to change this, please
>> let
>> > me
>> > > know if you don't agree.
>> > >
>> > > 2015-09-11 10:17 GMT+08:00 Yerui Sun <[email protected]>:
>> > >
>> > > > Hi, yu feng,
>> > > >   Let me guess the reason of your problem.
>> > > >
>> > > >   The num of reducers of converting hfile job depends on the
>>region
>> > > > numbers of corresponding HTable.
>> > > >
>> > > >   For now, all HTables were created with only one region, caused
>>by
>> the
>> > > > wrong path of rowkey_stats. I’ve opened a jira for this issue:
>> > > > https://issues.apache.org/jira/browse/KYLIN-968. The patch has
>>been
>> > > > available last night.
>> > > >
>> > > >   Here’s some clues to confirm my guessing:
>> > > >   1. You can find the corresponding HTable name in log, check its
>> > > regions,
>> > > > it should have only one region.
>> > > >   2. Check your kylin working directory on hdfs, there should be a
>> path
>> > > > like ‘../kylin-null/../rowkey_stats'.
>> > > >   3. Grep your kylin.log in tomcat dir, you should find the log
>> > contains
>> > > > ‘no region split, HTable will be one region’.
>> > > >
>> > > >   If you hit all the three clues, I think KYLIN-968 could resolve
>> your
>> > > > problem.
>> > > >
>> > > >
>> > > > > 在 2015年9月11日，00:54，yu feng <[email protected]> 写道：
>> > > > >
>> > > > > OK, I find another problem(I am a problem maker, ^_^), today I
>>buid
>> > > this
>> > > > > cube which has 15 dimensions(one Mandatory dimension, to
>>hierarchy
>> > > > > dimension and others are normal dimension), I find cuboid files
>>are
>> > > > 1.9TB,
>> > > > > in the step of converting cuboid to hfile it is too slow. I
>>check
>> the
>> > > log
>> > > > > of this job and find there are 9000+ mappers and only one
>>reducer.
>> > > > >
>> > > > > I discard this job when our hadoop administrator tells me the
>>node
>> > > witch
>> > > > > run this reducer is out of space of disk. I have to stop it, I
>>am
>> > doubt
>> > > > > that why there are only one reducer(I do not check source code
>>of
>> > this
>> > > > > job), By the way, my original data is only hundreds MB. I think
>> this
>> > > > would
>> > > > > cause more problems if original is bigger or dimension is much
>> more..
>> > > > >
>> > > > > 2015-09-10 23:46 GMT+08:00 Luke Han <[email protected]>:
>> > > > >
>> > > > >> The 2.0 will not come recently, there are huge refactor and
>>bunch
>> of
>> > > new
>> > > > >> features, we have to make sure there are no critical bugs
>>before
>> > > > release.
>> > > > >>
>> > > > >> The same function also available under v1.x branch, please stay
>> > tuned
>> > > > for
>> > > > >> update information for that.
>> > > > >>
>> > > > >> Thanks.
>> > > > >>
>> > > > >>
>> > > > >> Best Regards!
>> > > > >> ---------------------
>> > > > >>
>> > > > >> Luke Han
>> > > > >>
>> > > > >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <[email protected]>
>> > > wrote:
>> > > > >>
>> > > > >>> What good news !  I wish you can release the version as
>>quickly
>> as
>> > > > >>> possible, Today, I build a cube whose cuboid files is 1.9TB.
>>If
>> we
>> > > > merge
>> > > > >>> cube based on cuboid files, I think it will be very slowly..
>> > > > >>>
>> > > > >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <[email protected]>:
>> > > > >>>
>> > > > >>>> We have implemented the merge from HTable directly in Kylin
>>2.0,
>> > > which
>> > > > >>>> hasn’t been released/announced.
>> > > > >>>>
>> > > > >>>> On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote:
>> > > > >>>>
>> > > > >>>>> I think kylin can finish merging just depend on tables on
>> hbase,
>> > > This
>> > > > >>> will
>> > > > >>>>> make merging cubes more quickly, Isn't it ?
>> > > > >>>>>
>> > > > >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:
>> > > > >>>>>
>> > > > >>>>>> After check source code, I find you are right, cuboid files
>> will
>> > > be
>> > > > >>> used
>> > > > >>>>>> while merging segments, But a new question comes, Why kylin
>> > merge
>> > > > >>>>>> segment
>> > > > >>>>>> just based on hfile, I can not find how to take hbase
>>table as
>> > > input
>> > > > >>>>>> format
>> > > > >>>>>> of mapreduce job, But kylin take HFileOutputFormat as
>>output
>> > > format
>> > > > >>>>>> while
>> > > > >>>>>> changing cuboid to hfile.
>> > > > >>>>>>
>> > > > >>>>>> From this, I find kylin will take more space for a cube
>> > actually ,
>> > > > >> not
>> > > > >>>>>> only hfile but also cuboid files, the former are used for
>> query
>> > > and
>> > > > >>> the
>> > > > >>>>>> latter are used for merge, and the capacity of cuboid
>>files is
>> > > > >> bigger
>> > > > >>>>>> than
>> > > > >>>>>> hfiles.
>> > > > >>>>>>
>> > > > >>>>>> I think we could do some thing to optimize it... I want to
>> know
>> > > your
>> > > > >>>>>> opinions about it .
>> > > > >>>>>>
>> > > > >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
>> > > > >>>>>>
>> > > > >>>>>>> Hi, yu feng,
>> > > > >>>>>>>  I’ve also noticed these files and opened a jira:
>> > > > >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll
>> > post a
>> > > > >>> patch
>> > > > >>>>>>> tonight.
>> > > > >>>>>>>
>> > > > >>>>>>>  Here’s my opinions on your three question, feel free to
>> > correct
>> > > > >> me:
>> > > > >>>>>>>
>> > > > >>>>>>>  First, the data path of intermediate hive table should be
>> > > deleted
>> > > > >>>>>>> after
>> > > > >>>>>>> building, I agreed with that.
>> > > > >>>>>>>
>> > > > >>>>>>>  Second, the cuboid files will be used for merge and will
>>be
>> > > > >> deleted
>> > > > >>>>>>> when merging job completed, we need and must leave them on
>> > hdfs.
>> > > > >> The
>> > > > >>>>>>> fact_distint_columns should be deleted. In additionally,
>>the
>> > path
>> > > > >> of
>> > > > >>>>>>> rowkey_stats and hfile
>> > > > >>>>>>> should also be deleted.
>> > > > >>>>>>>
>> > > > >>>>>>>  Third, there’s no garbage collection steps if a job
>>discard,
>> > > > >> maybe
>> > > > >>> we
>> > > > >>>>>>> need a patch for this.
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>> Short answer:
>> > > > >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files
>>after
>> > > > >>> buildJob
>> > > > >>>>>>> and mergeJob completed.
>> > > > >>>>>>>  The hdfs path will not be cleanup if a job was
>>discarded, we
>> > > need
>> > > > >>>>>>> improvement on this.
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>> Best Regards,
>> > > > >>>>>>> Yerui Sun
>> > > > >>>>>>> [email protected]
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>> 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
>> > > > >>>>>>>>
>> > > > >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
>> > > > >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
>> > > > >>>>>>>>
>> > > > >>>>>>>> However, after my test and check the source code , I find
>> some
>> > > > >>>>>>> rubbish(I am not
>> > > > >>>>>>>> sure) file in HDFS.
>> > > > >>>>>>>>
>> > > > >>>>>>>> First, kylin only drop the Intermediate table in hive,
>>but
>> the
>> > > > >>> table
>> > > > >>>>>>> is
>> > > > >>>>>>> an
>> > > > >>>>>>>> EXTERNAL table, the file still exist in kylin tmp
>>directory
>> in
>> > > > >>> HDFS(I
>> > > > >>>>>>> check
>> > > > >>>>>>>> that..)
>> > > > >>>>>>>>
>> > > > >>>>>>>> Second, the cuboid files take a large space in HDFS, and
>> kylin
>> > > do
>> > > > >>> not
>> > > > >>>>>>>> delete after the cube build(fact_distinct_columns files
>> exist
>> > > > >> too).
>> > > > >>>>>>> I am
>> > > > >>>>>>>> not sure if those has other effects, remind me please if
>>it
>> > > has..
>> > > > >>>>>>>>
>> > > > >>>>>>>> Third, After I discard a job, I think kylin should delete
>> the
>> > > > >>>>>>> Intermediate
>> > > > >>>>>>>> files and drop Intermediate hive table, even though
>>delete
>> > > > >>>>>>>> them asynchronous. I think those data do not have any
>> > > > >>>>>>> effects..remind me
>> > > > >>>>>>>> please if it has..
>> > > > >>>>>>>>
>> > > > >>>>>>>> These are rubbish datas still exist in current
>> > > > >> version(kylin-1.0),
>> > > > >>>>>>> please
>> > > > >>>>>>>> check, thanks..
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>
>> > > > >>>>
>> > > > >>>>
>> > > > >>>
>> > > > >>
>> > > >
>> > > >
>> > >
>> >
>>

Re: rubbish files exist in HDFS

Reply via email to