[jira] [Updated] (KYLIN-702) When Kylin create the flat hive table, it generates large number of small files in HDFS

Shaofeng SHI (JIRA) Fri, 17 Apr 2015 07:29:42 -0700

     [ 
https://issues.apache.org/jira/browse/KYLIN-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shaofeng SHI updated KYLIN-702:
-------------------------------
    Description: 
When I build a cube, I noticed that when build the dictionary and calculate the 
cube, there are a large number of mappers be started (more than 10,000); With 
the log I noticed many mappers has 0 or much less records to process, this 
confused me; 

Then I checked the storage location of the flat table, found there are many 
files; I did a count and found it is the same number as the mappers; 

Too many mappers will cause much overhead, and downgrade the cluster's 
performance; Kylin should ask Hive to merge those small files during creating 
the flat table step. 

In my hadoop cluster, the hive.merge.mapredfiles was set to false (default 
value); After changing it to true for Kylin's job, the intermediate table's 
file number was reduced to 4, each be up to 256M, looks good; Check hive 
configuration at: 
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration

  was:
When I build a cube, I noticed that when build the dictionary and calculate the 
cube, there are a large number of mappers be started (more than 10,000); With 
the log I noticed many mappers has 0 or much less records to process, this 
confused me; 

Then I checked the storage location of the flat table, found there are many 
files; I did a count and found it is the same number as the mappers; 

Too many mappers will cause much overhead, and download the cluster's 
performance; Kylin should ask Hive to merge those small files during creating 
the flat table step. 


> When Kylin create the flat hive table, it generates large number of small 
> files in HDFS 
> ----------------------------------------------------------------------------------------
>
>                 Key: KYLIN-702
>                 URL: https://issues.apache.org/jira/browse/KYLIN-702
>             Project: Kylin
>          Issue Type: Improvement
>          Components: General
>    Affects Versions: v0.7.1
>            Reporter: Shaofeng SHI
>
> When I build a cube, I noticed that when build the dictionary and calculate 
> the cube, there are a large number of mappers be started (more than 10,000); 
> With the log I noticed many mappers has 0 or much less records to process, 
> this confused me; 
> Then I checked the storage location of the flat table, found there are many 
> files; I did a count and found it is the same number as the mappers; 
> Too many mappers will cause much overhead, and downgrade the cluster's 
> performance; Kylin should ask Hive to merge those small files during creating 
> the flat table step. 
> In my hadoop cluster, the hive.merge.mapredfiles was set to false (default 
> value); After changing it to true for Kylin's job, the intermediate table's 
> file number was reduced to 4, each be up to 256M, looks good; Check hive 
> configuration at: 
> https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KYLIN-702) When Kylin create the flat hive table, it generates large number of small files in HDFS

Reply via email to