Re: rubbish files exist in HDFS

Yerui Sun Thu, 10 Sep 2015 19:18:06 -0700

Hi, yu feng,
  Let me guess the reason of your problem.

  The num of reducers of converting hfile job depends on the region numbers of 
corresponding HTable.


  For now, all HTables were created with only one region, caused by the wrong 
path of rowkey_stats. I’ve opened a jira for this issue: 
https://issues.apache.org/jira/browse/KYLIN-968. The patch has been available 
last night.

  Here’s some clues to confirm my guessing:
  1. You can find the corresponding HTable name in log, check its regions, it 
should have only one region.
  2. Check your kylin working directory on hdfs, there should be a path like 
‘../kylin-null/../rowkey_stats'.
  3. Grep your kylin.log in tomcat dir, you should find the log contains ‘no 
region split, HTable will be one region’.

  If you hit all the three clues, I think KYLIN-968 could resolve your problem.

  
> 在 2015年9月11日，00:54，yu feng <[email protected]> 写道：
> 
> OK, I find another problem(I am a problem maker, ^_^), today I buid this
> cube which has 15 dimensions(one Mandatory dimension, to hierarchy
> dimension and others are normal dimension), I find cuboid files are 1.9TB,
> in the step of converting cuboid to hfile it is too slow. I check the log
> of this job and find there are 9000+ mappers and only one reducer.
> 
> I discard this job when our hadoop administrator tells me the node witch
> run this reducer is out of space of disk. I have to stop it, I am doubt
> that why there are only one reducer(I do not check source code of this
> job), By the way, my original data is only hundreds MB. I think this would
> cause more problems if original is bigger or dimension is much more..
> 
> 2015-09-10 23:46 GMT+08:00 Luke Han <[email protected]>:
> 
>> The 2.0 will not come recently, there are huge refactor and bunch of new
>> features, we have to make sure there are no critical bugs before release.
>> 
>> The same function also available under v1.x branch, please stay tuned for
>> update information for that.
>> 
>> Thanks.
>> 
>> 
>> Best Regards!
>> ---------------------
>> 
>> Luke Han
>> 
>> On Thu, Sep 10, 2015 at 7:50 PM, yu feng <[email protected]> wrote:
>> 
>>> What good news !  I wish you can release the version as quickly as
>>> possible, Today, I build a cube whose cuboid files is 1.9TB. If we merge
>>> cube based on cuboid files, I think it will be very slowly..
>>> 
>>> 2015-09-10 19:34 GMT+08:00 Shi, Shaofeng <[email protected]>:
>>> 
>>>> We have implemented the merge from HTable directly in Kylin 2.0, which
>>>> hasn’t been released/announced.
>>>> 
>>>> On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote:
>>>> 
>>>>> I think kylin can finish merging just depend on tables on hbase, This
>>> will
>>>>> make merging cubes more quickly, Isn't it ?
>>>>> 
>>>>> 2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:
>>>>> 
>>>>>> After check source code, I find you are right, cuboid files will be
>>> used
>>>>>> while merging segments, But a new question comes, Why kylin merge
>>>>>> segment
>>>>>> just based on hfile, I can not find how to take hbase table as input
>>>>>> format
>>>>>> of mapreduce job, But kylin take HFileOutputFormat as  output format
>>>>>> while
>>>>>> changing cuboid to hfile.
>>>>>> 
>>>>>> From this, I find kylin will take more space for a cube actually ,
>> not
>>>>>> only hfile but also cuboid files, the former are used for query and
>>> the
>>>>>> latter are used for merge, and the capacity of cuboid files is
>> bigger
>>>>>> than
>>>>>> hfiles.
>>>>>> 
>>>>>> I think we could do some thing to optimize it... I want to know your
>>>>>> opinions about it .
>>>>>> 
>>>>>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
>>>>>> 
>>>>>>> Hi, yu feng,
>>>>>>>  I’ve also noticed these files and opened a jira:
>>>>>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a
>>> patch
>>>>>>> tonight.
>>>>>>> 
>>>>>>>  Here’s my opinions on your three question, feel free to correct
>> me:
>>>>>>> 
>>>>>>>  First, the data path of intermediate hive table should be deleted
>>>>>>> after
>>>>>>> building, I agreed with that.
>>>>>>> 
>>>>>>>  Second, the cuboid files will be used for merge and will be
>> deleted
>>>>>>> when merging job completed, we need and must leave them on hdfs.
>> The
>>>>>>> fact_distint_columns should be deleted. In additionally, the path
>> of
>>>>>>> rowkey_stats and hfile
>>>>>>> should also be deleted.
>>>>>>> 
>>>>>>>  Third, there’s no garbage collection steps if a job discard,
>> maybe
>>> we
>>>>>>> need a patch for this.
>>>>>>> 
>>>>>>> 
>>>>>>> Short answer:
>>>>>>>  KYLIN-978 will clean all hdfs path except cuboid files after
>>> buildJob
>>>>>>> and mergeJob completed.
>>>>>>>  The hdfs path will not be cleanup if a job was discarded, we need
>>>>>>> improvement on this.
>>>>>>> 
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Yerui Sun
>>>>>>> [email protected]
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
>>>>>>>> 
>>>>>>>> I see this core Improvement in release 1.0, JIRA url :
>>>>>>>> https://issues.apache.org/jira/browse/KYLIN-926
>>>>>>>> 
>>>>>>>> However, after my test and check the source code , I find some
>>>>>>> rubbish(I am not
>>>>>>>> sure) file in HDFS.
>>>>>>>> 
>>>>>>>> First, kylin only drop the Intermediate table in hive, but the
>>> table
>>>>>>> is
>>>>>>> an
>>>>>>>> EXTERNAL table, the file still exist in kylin tmp directory in
>>> HDFS(I
>>>>>>> check
>>>>>>>> that..)
>>>>>>>> 
>>>>>>>> Second, the cuboid files take a large space in HDFS, and kylin do
>>> not
>>>>>>>> delete after the cube build(fact_distinct_columns files exist
>> too).
>>>>>>> I am
>>>>>>>> not sure if those has other effects, remind me please if it has..
>>>>>>>> 
>>>>>>>> Third, After I discard a job, I think kylin should delete the
>>>>>>> Intermediate
>>>>>>>> files and drop Intermediate hive table, even though delete
>>>>>>>> them asynchronous. I think those data do not have any
>>>>>>> effects..remind me
>>>>>>>> please if it has..
>>>>>>>> 
>>>>>>>> These are rubbish datas still exist in current
>> version(kylin-1.0),
>>>>>>> please
>>>>>>>> check, thanks..
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: rubbish files exist in HDFS

Reply via email to