[
https://issues.apache.org/jira/browse/MAPREDUCE-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13805452#comment-13805452
]
Jason Lowe commented on MAPREDUCE-5186:
---------------------------------------
bq. I wasn't sure whether the concern that led to the max split locations is no
longer valid (at least in hadoop 2).
AFAIK the limit was primarily intended to protect the JobTracker from being
excessively bogged down by jobs with huge numbers of split locations. In
Hadoop 2 the concern is lessened somewhat since part of the JT responsibilities
are in the job-specific AM. However the RM could still be impacted since it
will see a large request from the AM as it tries to obtain locality for the
split.
bq. I'm not 100% familiar with this part of the code, but by truncating aren't
we dropping some of these locations on the floor?
No, I don't believe so. If you look at JobSplitWriter it is writing out the
split details separately from the locations. The locations are used solely for
determining locality for map tasks. From a map task perspective, the locations
aren't very interesting since it just needs to be able to open the file(s). It
doesn't need the locations to do that, and if some split did need location
details they'd be written out as part of serializing the split itself. So
bottom line for truncating the locations is that data locality might not be as
ideal as it could be for the task processing the split, but the split will
still be processed properly regardless.
Given that a user/admin can disable the location truncation off by setting the
config to a very large value if desired, I think we should preserve the MR1
behavior to lighten the load on the AM/RM for jobs with lots of split
locations. I think the first patch is very close, although the unit test
change in that patch isn't appropriate. It should really be more of a unit
test (i.e.: no minicluster) that invokes the JobSplitWriter to write out splits
with more locations than the configured maximum and verify that the locations
were truncated in the job split file written. There should be unit tests for
both old-style splits and new-style splits to verify the truncation is
occurring in both places.
> mapreduce.job.max.split.locations causes some splits created by
> CombineFileInputFormat to fail
> ----------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-5186
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5186
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: job submission
> Affects Versions: 2.0.4-alpha, 2.2.0
> Reporter: Sangjin Lee
> Assignee: Robert Parker
> Priority: Critical
> Attachments: MAPREDUCE-5186v1.patch, MAPREDUCE-5186v2.patch
>
>
> CombineFileInputFormat can easily create splits that can come from many
> different locations (during the last pass of creating "global" splits).
> However, we observe that this often runs afoul of the
> mapreduce.job.max.split.locations check that's done by JobSplitWriter.
> The default value for mapreduce.job.max.split.locations is 10, and with any
> decent size cluster, CombineFileInputFormat creates splits that are well
> above this limit.
--
This message was sent by Atlassian JIRA
(v6.1#6144)