[ 
https://issues.apache.org/jira/browse/HCATALOG-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489648#comment-13489648
 ] 

Arup Malakar commented on HCATALOG-538:
---------------------------------------

For 100GB of TEXT data, I get around 840 map tasks. Now if the data is 
uniformly distributed each map task ends up creating 300 part files 
corresponding to 300 partitions it processed. Which makes the total number of 
files created by the job to be in the order of 840 * 300 = 252,000 ( I am 
seeing the number to be around ~100K in my case, this probably is because not 
all maps sees all 300 partitions in it's input).

The moving of these humongous number files from temporary location to final 
location takes time. I am looking into how to optimize this. But the number of 
files created is a concern too as it is gonna put heavy load on the namenode. 
                
> HCatalogStorer fails for 100GB of data with dynamic partitioning (number of 
> partition is 300)
> ---------------------------------------------------------------------------------------------
>
>                 Key: HCATALOG-538
>                 URL: https://issues.apache.org/jira/browse/HCATALOG-538
>             Project: HCatalog
>          Issue Type: Bug
>    Affects Versions: 0.4, 0.5
>         Environment: Hadoop 0.23.4
> HCatalog 0.4
>            Reporter: Arup Malakar
>            Assignee: Arup Malakar
>
> A hadoop job with 100GB of data  and 300 partitions fails. All the maps 
> succeed fine but the commit job fails after that. This looks like a timeout 
> issue as commitJob() takes more than 10 minutes. I am running this on 
> hadoop-0.23.4. I am playing with yarn.nm.liveness-monitor.expiry-interval-ms, 
> yarn.am.liveness-monitor.expiry-interval-ms etc to make it work.
> This JIRA is for optimizing the commitJob(), as 10 minutes is too long.
> On a side note for storing 100GB of data without partition takes ~12 minutes, 
> same amount of data with 300 partitions fails after 45 minutes. These tests 
> were run on a 10 node cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to