[ 
https://issues.apache.org/jira/browse/HADOOP-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646284#action_12646284
 ] 

Hemanth Yamijala commented on HADOOP-4445:
------------------------------------------

There are a few more issues with this information display. I had an offline 
discussion with Vivek, and we came up with a few observations and ideas.

- The information is accessed by the UI update thread and the updateQSI method 
without proper synchronization.This should be addressed. Currently, the 
QueueSchedulingInfo object is a simple data object, and the information in a 
given instance of this should be accessed together. Currently, the access to 
this object is done synchronized via the TaskSchedulingMgr object. Maybe then, 
instead of accessing the QSI fields directly, it should access it via the 
TaskSchedulingMgr.

- The capacity scheduler also updates only the reduce scheduler or the map 
scheduler in a given heartbeat. So in a scenario where reduce tasks are 
finishing along with map tasks, since we update the reduce scheduler in 
preference to the map scheduler, the information for the map tasks could be off 
by more than a heartbeat. However, in a steady state, this may not be that big 
an issue.

There are some options to address this:
- We could make it explicit that the information is not synch'ed with the 
cluster summary (as mentioned by Vivek above, though the information should 
probably not be treated as off by only a heartbeat)
- We could ensure that the information of either the map and reduce scheduler 
is updated at least once every so often. For e.g. we could update it once every 
3 heartbeats or so.
- We could also have an updater thread that runs periodically and updates the 
numbers every time it runs. We could use the same code for updates as the 
updateQSI method itself, thought it could maintain a separate copy of the data, 
so as to not introduce synchronization constraints on the scheduler. 

The advantage with the last two approaches is that we could deterministically 
say how far off the scheduling info would be, as compared to the cluster 
status. For e.g. if the updater thread runs once every 30 seconds, we could say 
the information would be off by 30 seconds.

Since in any case it appears that the information cannot be completely in sync, 
maybe we should leave it simple for now, mark that the information is not 
synchronized with the cluster status, and see if in steady state the 
information is way off. If that happens, we could fix it using one of the 
methods I've stated above. Thoughts ?

> Wrong number of running map/reduce tasks are displayed in queue information.
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-4445
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4445
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Hadoop r705159, Queue=default, GC=100% 
> MapCapacity=ReduceCapacity=212
>            Reporter: Karam Singh
>            Assignee: Sreekanth Ramakrishnan
>
> Wrong number of running map/reduce tasks are displayed in queue information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to