[
https://issues.apache.org/jira/browse/MAPREDUCE-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736031#action_12736031
]
Hong Tang commented on MAPREDUCE-801:
-------------------------------------
@vinod, PIG has its own input handling system (Slicer ~= InputFormat, Slice ~=
Input Split), when PIG uses MapReduce as the backend, the default Slicer
(PigSlicer) creates slices for each DFS block. However, there is a bug in the
code that instead of returning the hosts for that particular block, it returns
the aggregation of all hosts for all blocks of a file (ignoring the offset and
length of the slice). It probably would help you understand the problem by
simply looking at the patch attached with PIG-878.
I can imagine similar problems may happen for non-expert users trying to write
his/her input formats. You may argue that (1) only affects the user (directly),
however, we are sharing the same cluster with many users, and poor locality
could thrash the whole cluster and thus affecting all users' jobs (indirectly).
The proposal does not really solve the problem, it merely makes sure the
problem does not go silently without being noticed.
For (2), yes, we may choose to use a fraction of the locations, but do we need
to worry that the scheduler may try to schedule tasks on those subset of hosts
and thus could make the actual job running much slower (than not specifying
locations at all)?
> MAPREDUCE framework should issue warning with too many locations for a split
> ----------------------------------------------------------------------------
>
> Key: MAPREDUCE-801
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-801
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Hong Tang
>
> Customized input-format may be buggy and report misleading locations through
> input-split, an example of which is PIG-878. When an input split returns too
> many locations, it would not only artificially inflate the percentage of data
> local or rack local maps, but also force scheduler to use more memory and
> work harder to conduct task assignment.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.