[ https://issues.apache.org/jira/browse/HADOOP-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659091#action_12659091 ]
Amar Kamat commented on HADOOP-4766: ------------------------------------ bq. Note that the total memory used after running 9 sleep jobs (100,000 maps with 1 sec wait) back to back (few were killed) was ~384MB. I realized that while performing the above experiment I was constantly analyzing job history which loads the parsed job's history in the memory. Here are the results for the same experiment on 200 nodes without any interference. ||run-no ||memory before job-run ||job-runtime|| |1 |9.74 MB |25.78 min| |2 |71 MB |25.58 min | |3 |4.88 MB |25.63 min| |4 |6.14 MB |25.60 min| |5 |4.92 MB |25.63 min| |6 |10.32 MB |25.63 min| Even after running few large (100,000 maps) jobs the job tracker memory usage went as low as 3MB. It went upto a maximum of ~80MB. Note that I did GC in the {{ExpireLaunchingTasks}} thread. Some points to note : - I think that {{JobTracker}} should have a mechanism where it drops completed jobs whenever it suspects that its running low on memory. There is no point in keeping 100 jobs per user and slowing down/killing the JT. One way to do this would be to drop completed jobs whenever the JT's memory(used memory) crosses x% of maximum available memory. x% by default can be 75%. Completed jobs might be evicted based on their age (job finish time). This cleanup should happen until the JT's memory goes below the limit. - Also a job should be accepted (expanded) once there is sufficient memory i.e within the usable memory (x * max_available_memory). - Job history analysis caches some job analysis results (see {{loadhistory.jsp}}). This might cause problem if large jobs are analyzed. I feel we should not cache job-history analysis results and redo it everytime. > Hadoop performance degrades significantly as more and more jobs complete > ------------------------------------------------------------------------ > > Key: HADOOP-4766 > URL: https://issues.apache.org/jira/browse/HADOOP-4766 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.18.2, 0.19.0 > Reporter: Runping Qi > Assignee: Amar Kamat > Priority: Blocker > Fix For: 0.18.3, 0.19.1, 0.20.0 > > Attachments: HADOOP-4766-v1.patch, map_scheduling_rate.txt > > > When I ran the gridmix 2 benchmark load on a fresh cluster of 500 nodes with > hadoop trunk, > the gridmix load, consisting of 202 map/reduce jobs of various sizes, > completed in 32 minutes. > Then I ran the same set of the jobs on the same cluster, yhey completed in 43 > minutes. > When I ran them the third times, it took (almost) forever --- the job tracker > became non-responsive. > The job tracker's heap size was set to 2GB. > The cluster is configured to keep up to 500 jobs in memory. > The job tracker kept one cpu busy all the time. Look like it was due to GC. > I believe the release 0.18/0.19 have the similar behavior. > I believe 0.18 and 0.18 also have the similar behavior. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.