Re: rubbish files exist in HDFS

yu feng Thu, 10 Sep 2015 04:23:42 -0700

I think kylin can finish merging just depend on tables on hbase, This will
make merging cubes more quickly, Isn't it ?


2015-09-10 19:16 GMT+08:00 yu feng <[email protected]>:

> After check source code, I find you are right, cuboid files will be used
> while merging segments, But a new question comes, Why kylin merge segment
> just based on hfile, I can not find how to take hbase table as input format
> of mapreduce job, But kylin take HFileOutputFormat as  output format while
> changing cuboid to hfile.
>
> From this, I find kylin will take more space for a cube actually , not
> only hfile but also cuboid files, the former are used for query and the
> latter are used for merge, and the capacity of cuboid files is bigger than
> hfiles.
>
> I think we could do some thing to optimize it... I want to know your
> opinions about it .
>
> 2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:
>
>> Hi, yu feng,
>>   I’ve also noticed these files and opened a jira:
>> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
>> tonight.
>>
>>   Here’s my opinions on your three question, feel free to correct me:
>>
>>   First, the data path of intermediate hive table should be deleted after
>> building, I agreed with that.
>>
>>   Second, the cuboid files will be used for merge and will be deleted
>> when merging job completed, we need and must leave them on hdfs. The
>> fact_distint_columns should be deleted. In additionally, the path of
>> rowkey_stats and hfile
>> should also be deleted.
>>
>>   Third, there’s no garbage collection steps if a job discard, maybe we
>> need a patch for this.
>>
>>
>> Short answer:
>>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
>> and mergeJob completed.
>>   The hdfs path will not be cleanup if a job was discarded, we need
>> improvement on this.
>>
>>
>> Best Regards,
>> Yerui Sun
>> [email protected]
>>
>>
>>
>> > 在 2015年9月10日，18:20，yu feng <[email protected]> 写道：
>> >
>> > I see this core Improvement in release 1.0, JIRA url :
>> > https://issues.apache.org/jira/browse/KYLIN-926
>> >
>> > However, after my test and check the source code , I find some
>> rubbish(I am not
>> > sure) file in HDFS.
>> >
>> > First, kylin only drop the Intermediate table in hive, but the table is
>> an
>> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
>> check
>> > that..)
>> >
>> > Second, the cuboid files take a large space in HDFS, and kylin do not
>> > delete after the cube build(fact_distinct_columns files exist too). I am
>> > not sure if those has other effects, remind me please if it has..
>> >
>> > Third, After I discard a job, I think kylin should delete the
>> Intermediate
>> > files and drop Intermediate hive table, even though delete
>> > them asynchronous. I think those data do not have any effects..remind me
>> > please if it has..
>> >
>> > These are rubbish datas still exist in current version(kylin-1.0),
>> please
>> > check, thanks..
>>
>>
>

Re: rubbish files exist in HDFS

Reply via email to