After check source code, I find you are right, cuboid files will be used
while merging segments, But a new question comes, Why kylin merge segment
just based on hfile, I can not find how to take hbase table as input format
of mapreduce job, But kylin take HFileOutputFormat as  output format while
changing cuboid to hfile.

>From this, I find kylin will take more space for a cube actually , not only
hfile but also cuboid files, the former are used for query and the latter
are used for merge, and the capacity of cuboid files is bigger than hfiles.

I think we could do some thing to optimize it... I want to know your
opinions about it .

2015-09-10 18:36 GMT+08:00 Yerui Sun <[email protected]>:

> Hi, yu feng,
>   I’ve also noticed these files and opened a jira:
> https://issues.apache.org/jira/browse/KYLIN-978, and I’ll post a patch
> tonight.
>
>   Here’s my opinions on your three question, feel free to correct me:
>
>   First, the data path of intermediate hive table should be deleted after
> building, I agreed with that.
>
>   Second, the cuboid files will be used for merge and will be deleted when
> merging job completed, we need and must leave them on hdfs. The
> fact_distint_columns should be deleted. In additionally, the path of
> rowkey_stats and hfile
> should also be deleted.
>
>   Third, there’s no garbage collection steps if a job discard, maybe we
> need a patch for this.
>
>
> Short answer:
>   KYLIN-978 will clean all hdfs path except cuboid files after buildJob
> and mergeJob completed.
>   The hdfs path will not be cleanup if a job was discarded, we need
> improvement on this.
>
>
> Best Regards,
> Yerui Sun
> [email protected]
>
>
>
> > 在 2015年9月10日,18:20,yu feng <[email protected]> 写道:
> >
> > I see this core Improvement in release 1.0, JIRA url :
> > https://issues.apache.org/jira/browse/KYLIN-926
> >
> > However, after my test and check the source code , I find some rubbish(I
> am not
> > sure) file in HDFS.
> >
> > First, kylin only drop the Intermediate table in hive, but the table is
> an
> > EXTERNAL table, the file still exist in kylin tmp directory in HDFS(I
> check
> > that..)
> >
> > Second, the cuboid files take a large space in HDFS, and kylin do not
> > delete after the cube build(fact_distinct_columns files exist too). I am
> > not sure if those has other effects, remind me please if it has..
> >
> > Third, After I discard a job, I think kylin should delete the
> Intermediate
> > files and drop Intermediate hive table, even though delete
> > them asynchronous. I think those data do not have any effects..remind me
> > please if it has..
> >
> > These are rubbish datas still exist in current version(kylin-1.0), please
> > check, thanks..
>
>

Reply via email to