Re: rubbish files exist in HDFS

ShaoFeng Shi Thu, 10 Sep 2015 20:18:06 -0700

If "rowkey_stats" wasn't found, Kylin should throw exception and exit,
instead of using 1 region silently; I'm going to change this, please let me
know if you don't agree.


2015-09-11 10:17 GMT+08:00 Yerui Sun <[email protected]>:

> Hi, yu feng,
>   Let me guess the reason of your problem.
>
>   The num of reducers of converting hfile job depends on the region
> numbers of corresponding HTable.
>
>   For now, all HTables were created with only one region, caused by the
> wrong path of rowkey_stats. I’ve opened a jira for this issue:
> https://issues.apache.org/jira/browse/KYLIN-968. The patch has been
> available last night.
>
>   Here’s some clues to confirm my guessing:
>   1. You can find the corresponding HTable name in log, check its regions,
> it should have only one region.
>   2. Check your kylin working directory on hdfs, there should be a path
> like ‘../kylin-null/../rowkey_stats'.
>   3. Grep your kylin.log in tomcat dir, you should find the log contains
> ‘no region split, HTable will be one region’.
>
>   If you hit all the three clues, I think KYLIN-968 could resolve your
> problem.
>
>
> > 在 2015年9月11日，00:54，yu feng <[email protected]> 写道：
> >
> > OK, I find another problem(I am a problem maker, ^_^), today I buid this
> > cube which has 15 dimensions(one Mandatory dimension, to hierarchy
> > dimension and others are normal dimension), I find cuboid files are
> 1.9TB,
> > in the step of converting cuboid to hfile it is too slow. I check the log
> > of this job and find there are 9000+ mappers and only one reducer.
> >
> > I discard this job when our hadoop administrator tells me the node witch
> > run this reducer is out of space of disk. I have to stop it, I am doubt
> > that why there are only one reducer(I do not check source code of this
> > job), By the way, my original data is only hundreds MB. I think this
> would
> > cause more problems if original is bigger or dimension is much more..
> >
> > 2015-09-10 23:46 GMT+08:00 Luke Han <[email protected]>:
> >
> >> The 2.0 will not come recently, there are huge refactor and bunch of new
> >> features, we have to make sure there are no critical bugs before
> release.
> >>
> >> The same function also available under v1.x branch, please stay tuned
> for
> >> update information for that.
> >>
> >> Thanks.
> >>
> >>
> >> Best Regards!
> >> ---------------------
> >>
> >> Luke Han
> >>
> >> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <[email protected]> wrote:
> >>
> >>> What good news !  I wish you can release the version as quickly as
> >>> possible, Today, I build a cube whose cuboid files is 1.9TB. If we
> merge
> >>> cube based on cuboid files, I think it will be very slowly..
> >>>
> >>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <[email protected]>:
> >>>
> >>>> We have implemented the merge from HTable directly in Kylin 2.0, which
> >>>> hasn’t been released/announced.
> >>>>
> >>>> On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote:
> >>>>
> >>>>> I think kylin can finish merging just depend on tables on hbase, This
> >>> will
> >>>>> make merging cubes more quickly, Isn't it ?
> >>>>>
> >>>>> 2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:
> >>>>>
> >>>>>> After check source code, I find you are right, cuboid files will be
> >>> used
> >>>>>> while merging segments, But a new question comes, Why kylin merge
> >>>>>> segment
> >>>>>> just based on hfile, I can not find how to take hbase table as input
> >>>>>> format
> >>>>>> of mapreduce job, But kylin take HFileOutputFormat as  output format
> >>>>>> while
> >>>>>> changing cuboid to hfile.
> >>>>>>
> >>>>>> From this, I find kylin will take more space for a cube actually ,
> >> not
> >>>>>> only hfile but also cuboid files, the former are used for query and
> >>> the
> >>>>>> latter are used for merge, and the capacity of cuboid files is
> >> bigger
> >>>>>> than
> >>>>>> hfiles.
> >>>>>>
> >>>>>> I think we could do some thing to optimize it... I want to know your
> >>>>>> opinions about it .
> >>>>>>
> >>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
> >>>>>>
> >>>>>>> Hi, yu feng,
> >>>>>>>  I’ve also noticed these files and opened a jira:
> >>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a
> >>> patch
> >>>>>>> tonight.
> >>>>>>>
> >>>>>>>  Here’s my opinions on your three question, feel free to correct
> >> me:
> >>>>>>>
> >>>>>>>  First, the data path of intermediate hive table should be deleted
> >>>>>>> after
> >>>>>>> building, I agreed with that.
> >>>>>>>
> >>>>>>>  Second, the cuboid files will be used for merge and will be
> >> deleted
> >>>>>>> when merging job completed, we need and must leave them on hdfs.
> >> The
> >>>>>>> fact_distint_columns should be deleted. In additionally, the path
> >> of
> >>>>>>> rowkey_stats and hfile
> >>>>>>> should also be deleted.
> >>>>>>>
> >>>>>>>  Third, there’s no garbage collection steps if a job discard,
> >> maybe
> >>> we
> >>>>>>> need a patch for this.
> >>>>>>>
> >>>>>>>
> >>>>>>> Short answer:
> >>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files after
> >>> buildJob
> >>>>>>> and mergeJob completed.
> >>>>>>>  The hdfs path will not be cleanup if a job was discarded, we need
> >>>>>>> improvement on this.
> >>>>>>>
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>> Yerui Sun
> >>>>>>> [email protected]
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
> >>>>>>>>
> >>>>>>>> I see this core Improvement in release 1.0, JIRA url :
> >>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
> >>>>>>>>
> >>>>>>>> However, after my test and check the source code , I find some
> >>>>>>> rubbish(I am not
> >>>>>>>> sure) file in HDFS.
> >>>>>>>>
> >>>>>>>> First, kylin only drop the Intermediate table in hive, but the
> >>> table
> >>>>>>> is
> >>>>>>> an
> >>>>>>>> EXTERNAL table, the file still exist in kylin tmp directory in
> >>> HDFS(I
> >>>>>>> check
> >>>>>>>> that..)
> >>>>>>>>
> >>>>>>>> Second, the cuboid files take a large space in HDFS, and kylin do
> >>> not
> >>>>>>>> delete after the cube build(fact_distinct_columns files exist
> >> too).
> >>>>>>> I am
> >>>>>>>> not sure if those has other effects, remind me please if it has..
> >>>>>>>>
> >>>>>>>> Third, After I discard a job, I think kylin should delete the
> >>>>>>> Intermediate
> >>>>>>>> files and drop Intermediate hive table, even though delete
> >>>>>>>> them asynchronous. I think those data do not have any
> >>>>>>> effects..remind me
> >>>>>>>> please if it has..
> >>>>>>>>
> >>>>>>>> These are rubbish datas still exist in current
> >> version(kylin-1.0),
> >>>>>>> please
> >>>>>>>> check, thanks..
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Re: rubbish files exist in HDFS

Reply via email to