[ 
https://issues.apache.org/jira/browse/HADOOP-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659091#action_12659091
 ] 

Amar Kamat commented on HADOOP-4766:
------------------------------------

bq. Note that the total memory used after running 9 sleep jobs (100,000 maps 
with 1 sec wait) back to back (few were killed) was ~384MB.
I realized that while performing the above experiment I was constantly 
analyzing job history which loads the parsed job's history in the memory. 

Here are the results for the same experiment on 200 nodes without any 
interference.

||run-no ||memory before job-run ||job-runtime||
|1      |9.74 MB               |25.78 min|
|2      |71 MB                 |25.58 min |
|3      |4.88 MB               |25.63 min|
|4      |6.14 MB               |25.60 min|
|5      |4.92 MB               |25.63 min|
|6      |10.32 MB              |25.63 min|

Even after running few large (100,000 maps) jobs the job tracker memory usage 
went as low as 3MB. It went upto a maximum of ~80MB. Note that I did GC in the 
{{ExpireLaunchingTasks}} thread. 

Some points to note :
- I think that {{JobTracker}} should have a mechanism where it drops completed 
jobs whenever it suspects that its running low on memory. There is no point in 
keeping 100 jobs per user and slowing down/killing the JT. One way to do this 
would be to drop completed jobs whenever the JT's memory(used memory) crosses 
x% of maximum available memory. x% by default can be 75%. Completed jobs might 
be evicted based on their age (job finish time). This cleanup should happen 
until the JT's memory goes below the limit. 
- Also a job should be accepted (expanded) once there is sufficient memory i.e 
within the usable memory (x * max_available_memory).
- Job history analysis caches some job analysis results (see 
{{loadhistory.jsp}}). This might cause problem if large jobs are analyzed. I 
feel we should not cache job-history analysis results and redo it everytime.

> Hadoop performance degrades significantly as more and more jobs complete
> ------------------------------------------------------------------------
>
>                 Key: HADOOP-4766
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4766
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.18.2, 0.19.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Blocker
>             Fix For: 0.18.3, 0.19.1, 0.20.0
>
>         Attachments: HADOOP-4766-v1.patch, map_scheduling_rate.txt
>
>
> When I ran the gridmix 2 benchmark load on a fresh cluster of 500 nodes with 
> hadoop trunk, 
> the gridmix load, consisting of 202 map/reduce jobs of various sizes, 
> completed in 32 minutes. 
> Then I ran the same set of the jobs on the same cluster, yhey completed in 43 
> minutes.
> When I ran them the third times, it took (almost) forever --- the job tracker 
> became non-responsive.
> The job  tracker's heap size was set to 2GB. 
> The cluster is configured to keep up to 500 jobs in memory.
> The job tracker kept one cpu busy all the time. Look like it was due to GC.
> I believe the release 0.18/0.19 have the similar behavior.
> I believe 0.18 and 0.18 also have the similar behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to