[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170516#comment-15170516
 ] 

zhihai xu commented on MAPREDUCE-6622:
--------------------------------------

This patch also fixed a memory leak issue due to a race condition at 
{{CachedHistoryStorage.getFullJob}}, We can reproduce this memory leak issue by 
keeping refreshing the JHS web page for a job with more than 40,000 mappers 
quickly. The race condition is {{fileInfo.loadJob()}} takes long time to load 
the job with more than 40000 mappers, during that time, {{fileInfo.loadJob()}} 
is called multiple times for the same job because no synchronization between 
{{loadedJobCache.get(jobId)}} and {{loadJob(fileInfo)}}. You will see the used 
heap memory quickly go up. Looked at the heap dump, we find 56 {{CompletedJob}} 
instances for the same job ID, which have total more 2 million 
mappers(56*40000). Based on the link from 
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/cache/CacheBuilder.html#build(com.google.common.cache.CacheLoader)
This won't be an issue for com.google.common.cache.LoadingCache:
{code}
If another thread is currently loading the value for this key, simply waits for 
that thread to finish and returns its loaded value
{code}
This looks like a critical issue for me. Should we backport this patch to 2.7.3 
and 2.6.5 branch?


> Add capability to set JHS job cache to a task-based limit
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-6622
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6622
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobhistoryserver
>    Affects Versions: 2.7.2
>            Reporter: Ray Chiang
>            Assignee: Ray Chiang
>              Labels: supportability
>             Fix For: 2.9.0
>
>         Attachments: MAPREDUCE-6622.001.patch, MAPREDUCE-6622.002.patch, 
> MAPREDUCE-6622.003.patch, MAPREDUCE-6622.004.patch, MAPREDUCE-6622.005.patch, 
> MAPREDUCE-6622.006.patch, MAPREDUCE-6622.007.patch, MAPREDUCE-6622.008.patch, 
> MAPREDUCE-6622.009.patch, MAPREDUCE-6622.010.patch, MAPREDUCE-6622.011.patch, 
> MAPREDUCE-6622.012.patch, MAPREDUCE-6622.013.patch, MAPREDUCE-6622.014.patch
>
>
> When setting the property mapreduce.jobhistory.loadedjobs.cache.size the jobs 
> can be of varying size.  This is generally not a problem when the jobs sizes 
> are uniform or small, but when the job sizes can be very large (say greater 
> than 250k tasks), then the JHS heap size can grow tremendously.
> In cases, where multiple jobs are very large, then the JHS can lock up and 
> spend all its time in GC.  However, since the cache is holding on to all the 
> jobs, not much heap space can be freed up.
> By setting a property that sets a cap on the number of tasks allowed in the 
> cache and since the total number of tasks loaded is directly proportional to 
> the amount of heap used, this should help prevent the JHS from locking up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to