[ 
https://issues.apache.org/jira/browse/HADOOP-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665484#action_12665484
 ] 

Devaraj Das commented on HADOOP-4766:
-------------------------------------

The thing that worries me about the existing patch is that it is not at all 
predictable how many jobs/tasks would be there in memory at any point. In my 
experiments with this patch and a standalone program simulating the same 
behavior as what the patch is trying to do, I saw that even after purging all 
the jobs, the memory usage as per Runtime.totalMemory - Runtime.freeMemory 
didn't come down for quite a while and the thread was trying to free up memory 
needlessly (note that things like whether incremental GC is in use would also 
influence this behavior).
The approach of basing things on keeping at most 'n' completed tasks in memory 
at least leads to much more predictability. True that we don't know that the 
exact memory consumed by a TIP but we can make a good estimate there and tweak 
the value of the max tasks in memory if need be. Also, in the current patch, 
the configuration to do with the memory usage threshold is equally dependent on 
estimation. I am not sure what the threshold should be - should it be 0.75 or 
0.9 or 0.8..
Why do you say it is an overkill - i thought basing things on estimating total 
memory usage is more trickier. Basing it on number of completed tasks seems 
very similar to the "number of completed jobs" that we currently have. It's 
just that we are stepping one level below and specifying a value for something 
the base size of which is going to always remain in control. Also, completed 
jobs should be treated as one unit w.r.t removal. For example, if the value 
configured for the max tasks is 1000, and we have a job with 1100 tasks, the 
entire job should be removed (as opposed to removing only 1000 tasks of the 
job), keeping this whole thing really simple.
Again, this is a short term fix until we move to the model of having a separate 
History server process.

> Hadoop performance degrades significantly as more and more jobs complete
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-4766
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4766
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.18.2, 0.19.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Blocker
>         Attachments: HADOOP-4766-v1.patch, HADOOP-4766-v2.10.patch, 
> HADOOP-4766-v2.4.patch, HADOOP-4766-v2.6.patch, HADOOP-4766-v2.7-0.18.patch, 
> HADOOP-4766-v2.7-0.19.patch, HADOOP-4766-v2.7.patch, 
> HADOOP-4766-v2.8-0.18.patch, HADOOP-4766-v2.8-0.19.patch, 
> HADOOP-4766-v2.8.patch, map_scheduling_rate.txt
>
>
> When I ran the gridmix 2 benchmark load on a fresh cluster of 500 nodes with 
> hadoop trunk, 
> the gridmix load, consisting of 202 map/reduce jobs of various sizes, 
> completed in 32 minutes. 
> Then I ran the same set of the jobs on the same cluster, yhey completed in 43 
> minutes.
> When I ran them the third times, it took (almost) forever --- the job tracker 
> became non-responsive.
> The job  tracker's heap size was set to 2GB. 
> The cluster is configured to keep up to 500 jobs in memory.
> The job tracker kept one cpu busy all the time. Look like it was due to GC.
> I believe the release 0.18/0.19 have the similar behavior.
> I believe 0.18 and 0.18 also have the similar behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to