[
https://issues.apache.org/jira/browse/MAPREDUCE-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ray Chiang updated MAPREDUCE-6222:
----------------------------------
Attachment: JHS New Display Top.png
JHS Original Display Top.png
MAPREDUCE-6222.001.patch
I dug into this a bit. It looks like there are two issues with the JHS:
1) Reading in the .jhist file for jobs with many tasks can take a long time.
At 50k entries, loading a .jhist file takes ~20-25 seconds. At 500k entries,
reading the .jhist file takes over 5 minutes and hangs up the JHS. Other than
generating a partial (smaller) .jhist file or generating a compressed file, I
don't have a suggestion for handling this issue.
2) HsTasksBlock#render() runs quickly for 10k entries and fewer. Somewhere
past 10k entries, the for loop takes longer and longer, with my tests bogging
down around 16-17k entries. I tried a few simple tricks with the
tasksTableData StringBuffer, but still had issues at about the same point.
One suggested fix would be to limit the tasks displayed (i.e. paginate once
more). I've got a partial fix and I've attached some images showing how it
could end up looking. I'd appreciate any feedback before I go too much further
on the code.
> HistoryServer Hangs Processing Large Jobs
> -----------------------------------------
>
> Key: MAPREDUCE-6222
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6222
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Andrew Johnson
> Attachments: JHS New Display Top.png, JHS Original Display Top.png,
> MAPREDUCE-6222.001.patch, head.jhist, historyserver_jstack.txt
>
>
> I'm encountering an issue with the Mapreduce HistoryServer processing the
> history files for large jobs. This has come up several times with for jobs
> with around 60000 total tasks. When the HistoryServer loads the .jhist file
> from HDFS for a job of that size (which is usually around 500 Mb), the
> HistoryServer's CPU usage spiked and the UI became unresponsive. After about
> 10 minutes I restarted the HistoryServer and it was behaving normally again.
> The cluster is running CDH 5.3 (2.5.0-cdh5.3.0). I've attached the output of
> jstack from a time this was occurring.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)