[
https://issues.apache.org/jira/browse/MAPREDUCE-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738196#action_12738196
]
Devaraj Das commented on MAPREDUCE-801:
---------------------------------------
I like the idea of truncating the number of locations to some fixed number like
5, and ignoring the others. It's a simple fix in the framework to limit the
number of locations to read per split. If the split generation code is buggy
w.r.t generating the locations for the splits, then we can't do much anyway.
The location information is only used for creating the cache in the JobTracker
for doing optimal task assignments.
The other thing is the split bytes (the raw bytes corresponding to the
serialized split object). If the split data is too large and there are many
splits, then the JT again becomes vulnerable. The JT reads the split bytes, and
stores it in memory, so that it can be sent as part of the task object to the
tasktracker chosen to run the task. There are multiple approaches to solve the
problem:
1) Limit the size of the split file
2) Back the splits on disk. The idea here is to create an index file while the
JobTracker is reading the split file. The splits are read one by one and their
offsets in the file are stored in the index file. The split data is discarded;
the location information is retained (after truncating maybe) and the location
info is used to create the cache as is already done. The index file is kept in
memory. When a map task is to be handed out to a TT, the JobTracker reads the
split data by looking up the index and seeking into the split file (similar to
the way we handle map outputs during shuffle).
We could have a cap on the max split size per split (instead of a cap on the
total split size) so that we don't use up too much RPC bandwidth while
transferring the split data to the tasktracker. The alternative here would be
to have the JT just pass the index information to the TT, and have the TT read
the split data from the hdfs directly while localizing the task before the
launch..
Thoughts?
> MAPREDUCE framework should issue warning with too many locations for a split
> ----------------------------------------------------------------------------
>
> Key: MAPREDUCE-801
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-801
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Hong Tang
>
> Customized input-format may be buggy and report misleading locations through
> input-split, an example of which is PIG-878. When an input split returns too
> many locations, it would not only artificially inflate the percentage of data
> local or rack local maps, but also force scheduler to use more memory and
> work harder to conduct task assignment.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.