[
https://issues.apache.org/jira/browse/MAPREDUCE-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074194#comment-14074194
]
Hadoop QA commented on MAPREDUCE-6003:
--------------------------------------
{color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12657796/MAPREDUCE-6003.patch
against trunk revision .
{color:red}-1 patch{color}. The patch command could not apply the patch.
Console output:
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4768//console
This message is automatically generated.
> Resource Estimator suggests huge map output in some cases
> ---------------------------------------------------------
>
> Key: MAPREDUCE-6003
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6003
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Affects Versions: 1.2.1
> Reporter: Chengbing Liu
> Assignee: Chengbing Liu
> Attachments: MAPREDUCE-6003.patch
>
>
> In some cases, ResourceEstimator can return way too large map output
> estimation. This happens when input size is not correctly calculated.
> A typical case is when joining two Hive tables (one in HDFS and the other in
> HBase). The maps that process the HBase table finish first, which has a 0
> length of inputs due to its TableInputFormat. Then for a map that processes
> HDFS table, the estimated output size is very large because of the wrong
> input size, causing the map task not possible to be assigned.
> There are two possible solutions to this problem:
> (1) Make input size correct for each case, e.g. HBase, etc.
> (2) Use another algorithm to estimate the map output, or at least make it
> closer to reality.
> I prefer the second way, since the first would require all possibilities to
> be taken care of. It is not easy for some inputs such as URIs.
> In my opinion, we could make a second estimation which is independent of the
> input size:
> estimationB = (completedMapOutputSize / completedMaps) * totalMaps * 10
> Here, multiplying by 10 makes the estimation more conservative, so that it
> will be less likely to assign it to some where not big enough.
> The former estimation goes like this:
> estimationA = (inputSize * completedMapOutputSize * 2.0) /
> completedMapInputSize
> My suggestion is to take minimum of the two estimations:
> estimation = min(estimationA, estimationB)
--
This message was sent by Atlassian JIRA
(v6.2#6252)