[
https://issues.apache.org/jira/browse/MAPREDUCE-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737301#action_12737301
]
Doug Cutting commented on MAPREDUCE-801:
----------------------------------------
> The #locations per split to keep should probably be a cluster-wide config
> limit?
Sounds reasonable.
> Should we pick first n locations or pick randomly?
That depends on whether locations are ordered. For example, one might list
locations which have 90% of the data in a split ahead of locations that only
have 20%. (Think map-side-join, where a split might contain segments of
multiple files.) If that scenario sounds plausible, then we should pick the N
first, no?
> We should do truncation on both the JobClient and JobTracker to be wary of
> DOS if a malicious client submits too many locations per split...
This all still all feels like overkill to me. It reminds me of TSA policies
about shoes and liquids. There are not that many InputFormat implementations.
We should seek to make it easy to debug them generally rather than guard
against a particular bug seen once. To prevent DOS, we could put an overall
limit on the number of locations per job, or even the size of the splits file,
so that the JT doesn't run out of memory trying to process a job. We should
make it easier to notice when job locality is poor. But there are so many ways
folks can write poorly-performing applications and frameworks that spending a
lot of time guarding against this particular one seems a poor investment.
Also, truncation does nothing to, e.g., prevent an application that simply
lists the wrong locations. Truncation would not help locality in the case of
the PIG bug, since those locations were mostly wrong. The only thing
truncation does is protect against a job using too many resources in the JT,
and there are simpler ways to protect against that.
So, sure, if we're checking it once, why not twice!
> MAPREDUCE framework should issue warning with too many locations for a split
> ----------------------------------------------------------------------------
>
> Key: MAPREDUCE-801
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-801
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Hong Tang
>
> Customized input-format may be buggy and report misleading locations through
> input-split, an example of which is PIG-878. When an input split returns too
> many locations, it would not only artificially inflate the percentage of data
> local or rack local maps, but also force scheduler to use more memory and
> work harder to conduct task assignment.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.