[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291444#comment-13291444
 ] 

Konstantin Shvachko commented on MAPREDUCE-4305:
------------------------------------------------

Task locality is important. Interesting that it is only necessary to hook 
Capacity Scheduler up to the logic that already existed in JobInProgress etc. I 
went over the general logic of the patch. It looks good. But I have several 
formatting and code organization comments.
# Append _PROPERTY to new config key constants, e.g. 
NODE_LOCALITY_DELAY_PROPERTY. Looks like other constants in 
CapacitySchedulerConf are like that.
# Bend longs lines.
# In CapacitySchedulerConf convert comments describing variables to a JavaDoc.
# In initializeDefaults() you should use {{capacity-scheduler}} not 
{{fairscheduler}} config variables. Also since you introduced constants for the 
keys, use them rather than the raw keys.
# JobInfo is confusing because there is already a class with that name. Call it 
something like JobLocality. I'd rather move it into JobQueuesManager, because 
the latter maintains the map of those
# Correct indentations in CapacityTaskScheduler, particularly eliminate all 
tabs, should be spaces only.
# Add spaces between arguments, operators, and in some LOG messages.
# Add empty lines between new methods.
# updateLocalityWaitTimes() and updateLastMapLocalityLevel() should belong to 
JobQueuesManager, imo.
# JobQueuesManager.infos is a map keyed with JobInProgress. It'd be better to 
use JobID as a key?
# In TaskSchedulingMgr you need only one version of obtainNewTask to be 
abstract, the one with cachelevel parameter. The other one should not be 
abstract and just call the abstract obtainNewTask() with cachelevel set to any.

                
> Implement delay scheduling in capacity scheduler for improving data locality
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4305
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4305
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Mayank Bansal
>            Assignee: Mayank Bansal
>         Attachments: MAPREDUCE-4305, MAPREDUCE-4305-1.patch
>
>
> Capacity Scheduler data local tasks are about 40%-50% which is not good.
> While my test with 70 node cluster i consistently get data locality around 
> 40-50% on a free cluster.
> I think we need to implement something like delay scheduling in the capacity 
> scheduler for improving the data locality.
> http://radlab.cs.berkeley.edu/publication/308
> After implementing the delay scheduling on Hadoop 22 I am getting 100 % data 
> locality in free cluster and around 90% data locality in busy cluster.
> Thanks,
> Mayank

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to