[
https://issues.apache.org/jira/browse/MAPREDUCE-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291444#comment-13291444
]
Konstantin Shvachko commented on MAPREDUCE-4305:
------------------------------------------------
Task locality is important. Interesting that it is only necessary to hook
Capacity Scheduler up to the logic that already existed in JobInProgress etc. I
went over the general logic of the patch. It looks good. But I have several
formatting and code organization comments.
# Append _PROPERTY to new config key constants, e.g.
NODE_LOCALITY_DELAY_PROPERTY. Looks like other constants in
CapacitySchedulerConf are like that.
# Bend longs lines.
# In CapacitySchedulerConf convert comments describing variables to a JavaDoc.
# In initializeDefaults() you should use {{capacity-scheduler}} not
{{fairscheduler}} config variables. Also since you introduced constants for the
keys, use them rather than the raw keys.
# JobInfo is confusing because there is already a class with that name. Call it
something like JobLocality. I'd rather move it into JobQueuesManager, because
the latter maintains the map of those
# Correct indentations in CapacityTaskScheduler, particularly eliminate all
tabs, should be spaces only.
# Add spaces between arguments, operators, and in some LOG messages.
# Add empty lines between new methods.
# updateLocalityWaitTimes() and updateLastMapLocalityLevel() should belong to
JobQueuesManager, imo.
# JobQueuesManager.infos is a map keyed with JobInProgress. It'd be better to
use JobID as a key?
# In TaskSchedulingMgr you need only one version of obtainNewTask to be
abstract, the one with cachelevel parameter. The other one should not be
abstract and just call the abstract obtainNewTask() with cachelevel set to any.
> Implement delay scheduling in capacity scheduler for improving data locality
> ----------------------------------------------------------------------------
>
> Key: MAPREDUCE-4305
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4305
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Mayank Bansal
> Assignee: Mayank Bansal
> Attachments: MAPREDUCE-4305, MAPREDUCE-4305-1.patch
>
>
> Capacity Scheduler data local tasks are about 40%-50% which is not good.
> While my test with 70 node cluster i consistently get data locality around
> 40-50% on a free cluster.
> I think we need to implement something like delay scheduling in the capacity
> scheduler for improving the data locality.
> http://radlab.cs.berkeley.edu/publication/308
> After implementing the delay scheduling on Hadoop 22 I am getting 100 % data
> locality in free cluster and around 90% data locality in busy cluster.
> Thanks,
> Mayank
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira