[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197363#comment-16197363
 ] 

rangjiaheng commented on MAPREDUCE-6978:
----------------------------------------

The main reason is that when Container write Counter values to RPC, it write 
TaskCounter.class Enumeration value's ordinal, when AM read an ordinal from RPC 
which large than TaskCounter.class Enumeration values' size, it cause an 
OutOfBoundsException, and then the Container was kill by AM.

{code:java}
  public void readFields(DataInput in) throws IOException {
    clear();
    int len = WritableUtils.readVInt(in);
    T[] enums = enumClass.getEnumConstants();
    for (int i = 0; i < len; ++i) {
      int ord = WritableUtils.readVInt(in);
      Counter counter = newCounter(enums[ord]);
      counter.setValue(WritableUtils.readVLong(in));
      counters[ord] = counter;
    }   
  }
{code}

This problem happened when we are doing Gray Release, I believe this will not 
happen if we upgrade all the NMs simultaneously; however we prefer Gray Release.


> MR task counters deserialized through RPC throws OutOfBoundsException if 
> Counter enum class version not match
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6978
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6978
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, task
>    Affects Versions: 3.0.0-alpha4
>         Environment: NM1 TaskCounter.class old version; 
> NM2 TaskCounter.class new version (new Enumeration values appended); 
>            Reporter: rangjiaheng
>
> Environment:
> NM1 TaskCounter.class old version; 
> NM2 TaskCounter.class new version (new Enumeration values appended); 
> Result:
> When an MR app's AM running on NM1, and it's containers on NM2; the 
> containers on NM2 will all failed, AM cause OutOfBoundsException;
> Reason:
> When app running, containers will report their counters to AM through RPC, 
> while the Container with new version TaskCounter.class will write more 
> Counter values to RPC; however, the AM with old version TaskCounter.class 
> which can not read them correctly from RPC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to