[ https://issues.apache.org/jira/browse/MAPREDUCE-6003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chengbing Liu updated MAPREDUCE-6003: ------------------------------------- Attachment: MAPREDUCE-6003.patch > Resource Estimator suggests huge map output in some cases > --------------------------------------------------------- > > Key: MAPREDUCE-6003 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6003 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobtracker > Affects Versions: 1.2.1 > Reporter: Chengbing Liu > Assignee: Chengbing Liu > Attachments: MAPREDUCE-6003.patch > > > In some cases, ResourceEstimator can return way too large map output > estimation. This happens when input size is not correctly calculated. > A typical case is when joining two Hive tables (one in HDFS and the other in > HBase). The maps that process the HBase table finish first, which has a 0 > length of inputs due to its TableInputFormat. Then for a map that processes > HDFS table, the estimated output size is very large because of the wrong > input size, causing the map task not possible to be assigned. > There are two possible solutions to this problem: > (1) Make input size correct for each case, e.g. HBase, etc. > (2) Use another algorithm to estimate the map output, or at least make it > closer to reality. > I prefer the second way, since the first would require all possibilities to > be taken care of. It is not easy for some inputs such as URIs. > In my opinion, we could make a second estimation which is independent of the > input size: > estimationB = (completedMapOutputSize / completedMaps) * totalMaps * 10 > Here, multiplying by 10 makes the estimation more conservative, so that it > will be less likely to assign it to some where not big enough. > The former estimation goes like this: > estimationA = (inputSize * completedMapOutputSize * 2.0) / > completedMapInputSize > My suggestion is to take minimum of the two estimations: > estimation = min(estimationA, estimationB) -- This message was sent by Atlassian JIRA (v6.2#6252)