[jira] [Commented] (MAPREDUCE-5186) mapreduce.job.max.split.locations causes some splits created by CombineFileInputFormat to fail

Jason Lowe (JIRA) Fri, 25 Oct 2013 09:56:07 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13805452#comment-13805452
 ]


Jason Lowe commented on MAPREDUCE-5186:
---------------------------------------

bq. I wasn't sure whether the concern that led to the max split locations is no 
longer valid (at least in hadoop 2).

AFAIK the limit was primarily intended to protect the JobTracker from being 
excessively bogged down by jobs with huge numbers of split locations.  In 
Hadoop 2 the concern is lessened somewhat since part of the JT responsibilities 
are in the job-specific AM.  However the RM could still be impacted since it 
will see a large request from the AM as it tries to obtain locality for the 
split.

bq. I'm not 100% familiar with this part of the code, but by truncating aren't 
we dropping some of these locations on the floor? 

No, I don't believe so.  If you look at JobSplitWriter it is writing out the 
split details separately from the locations.  The locations are used solely for 
determining locality for map tasks.  From a map task perspective, the locations 
aren't very interesting since it just needs to be able to open the file(s).  It 
doesn't need the locations to do that, and if some split did need location 
details they'd be written out as part of serializing the split itself.  So 
bottom line for truncating the locations is that data locality might not be as 
ideal as it could be for the task processing the split, but the split will 
still be processed properly regardless.

Given that a user/admin can disable the location truncation off by setting the 
config to a very large value if desired, I think we should preserve the MR1 
behavior to lighten the load on the AM/RM for jobs with lots of split 
locations.  I think the first patch is very close, although the unit test 
change in that patch isn't appropriate.  It should really be more of a unit 
test (i.e.: no minicluster) that invokes the JobSplitWriter to write out splits 
with more locations than the configured maximum and verify that the locations 
were truncated in the job split file written.  There should be unit tests for 
both old-style splits and new-style splits to verify the truncation is 
occurring in both places.

> mapreduce.job.max.split.locations causes some splits created by 
> CombineFileInputFormat to fail
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5186
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5186
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: job submission
>    Affects Versions: 2.0.4-alpha, 2.2.0
>            Reporter: Sangjin Lee
>            Assignee: Robert Parker
>            Priority: Critical
>         Attachments: MAPREDUCE-5186v1.patch, MAPREDUCE-5186v2.patch
>
>
> CombineFileInputFormat can easily create splits that can come from many 
> different locations (during the last pass of creating "global" splits). 
> However, we observe that this often runs afoul of the 
> mapreduce.job.max.split.locations check that's done by JobSplitWriter.
> The default value for mapreduce.job.max.split.locations is 10, and with any 
> decent size cluster, CombineFileInputFormat creates splits that are well 
> above this limit.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (MAPREDUCE-5186) mapreduce.job.max.split.locations causes some splits created by CombineFileInputFormat to fail

Reply via email to