Re: rubbish files exist in HDFS

Shi, Shaofeng Thu, 10 Sep 2015 04:35:06 -0700

We have implemented the merge from HTable directly in Kylin 2.0, which
hasn’t been released/announced.


On 9/10/15, 7:22 PM, "yu feng" <[email protected]> wrote:

>I think kylin can finish merging just depend on tables on hbase, This will
>make merging cubes more quickly, Isn't it ?
>
>2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:
>
>> After check source code, I find you are right, cuboid files will be used
>> while merging segments, But a new question comes, Why kylin merge
>>segment
>> just based on hfile, I can not find how to take hbase table as input
>>format
>> of mapreduce job, But kylin take HFileOutputFormat as  output format
>>while
>> changing cuboid to hfile.
>>
>> From this, I find kylin will take more space for a cube actually , not
>> only hfile but also cuboid files, the former are used for query and the
>> latter are used for merge, and the capacity of cuboid files is bigger
>>than
>> hfiles.
>>
>> I think we could do some thing to optimize it... I want to know your
>> opinions about it .
>>
>> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
>>
>>> Hi, yu feng,
>>>   I’ve also noticed these files and opened a jira:
>>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
>>> tonight.
>>>
>>>   Here’s my opinions on your three question, feel free to correct me:
>>>
>>>   First, the data path of intermediate hive table should be deleted
>>>after
>>> building, I agreed with that.
>>>
>>>   Second, the cuboid files will be used for merge and will be deleted
>>> when merging job completed, we need and must leave them on hdfs. The
>>> fact_distint_columns should be deleted. In additionally, the path of
>>> rowkey_stats and hfile
>>> should also be deleted.
>>>
>>>   Third, there’s no garbage collection steps if a job discard, maybe we
>>> need a patch for this.
>>>
>>>
>>> Short answer:
>>>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
>>> and mergeJob completed.
>>>   The hdfs path will not be cleanup if a job was discarded, we need
>>> improvement on this.
>>>
>>>
>>> Best Regards,
>>> Yerui Sun
>>> [email protected]
>>>
>>>
>>>
>>> > 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
>>> >
>>> > I see this core Improvement in release 1.0, JIRA url :
>>> > https://issues.apache.org/jira/browse/KYLIN-926
>>> >
>>> > However, after my test and check the source code , I find some
>>> rubbish(I am not
>>> > sure) file in HDFS.
>>> >
>>> > First, kylin only drop the Intermediate table in hive, but the table
>>>is
>>> an
>>> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
>>> check
>>> > that..)
>>> >
>>> > Second, the cuboid files take a large space in HDFS, and kylin do not
>>> > delete after the cube build(fact_distinct_columns files exist too).
>>>I am
>>> > not sure if those has other effects, remind me please if it has..
>>> >
>>> > Third, After I discard a job, I think kylin should delete the
>>> Intermediate
>>> > files and drop Intermediate hive table, even though delete
>>> > them asynchronous. I think those data do not have any
>>>effects..remind me
>>> > please if it has..
>>> >
>>> > These are rubbish datas still exist in current version(kylin-1.0),
>>> please
>>> > check, thanks..
>>>
>>>
>>

Re: rubbish files exist in HDFS

Reply via email to