[ 
https://issues.apache.org/jira/browse/MAPREDUCE-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736031#action_12736031
 ] 

Hong Tang commented on MAPREDUCE-801:
-------------------------------------

@vinod, PIG has its own input handling system (Slicer ~= InputFormat, Slice ~= 
Input Split), when PIG uses MapReduce as the backend, the default Slicer 
(PigSlicer) creates slices for each DFS block. However, there is a bug in the 
code that instead of returning the hosts for that particular block, it returns 
the aggregation of all hosts for all blocks of a file (ignoring the offset and 
length of the slice). It probably would help you understand the problem by 
simply looking at the patch attached with PIG-878.

I can imagine similar problems may happen for non-expert users trying to write 
his/her input formats. You may argue that (1) only affects the user (directly), 
however, we are sharing the same cluster with many users, and poor locality 
could thrash the whole cluster and thus affecting all users' jobs (indirectly). 
The proposal does not really solve the problem, it merely makes sure the 
problem does not go silently without being noticed.

For (2), yes, we may choose to use a fraction of the locations, but do we need 
to worry that the scheduler may try to schedule tasks on those subset of hosts 
and thus could make the actual job running much slower (than not specifying 
locations at all)?

> MAPREDUCE framework should issue warning with too many locations for a split
> ----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-801
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-801
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
>
> Customized input-format may be buggy and report misleading locations through 
> input-split, an example of which is PIG-878. When an input split returns too 
> many locations, it would not only artificially inflate the percentage of data 
> local or rack local maps, but also force scheduler to use more memory and 
> work harder to conduct task assignment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to