[ 
https://issues.apache.org/jira/browse/TEZ-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512053#comment-14512053
 ] 

Bikas Saha edited comment on TEZ-2314 at 4/24/15 11:53 PM:
-----------------------------------------------------------

Looking at this more, the real issue is that heartbeating and sending all these 
objects happen regardless of whether initialization is in progress or not. 
Synchronization will not result in correct data. E.g. sending 2 out of 3 
members is still possible. Besides, accessing these (and other future members) 
while initialization is in progress is fraught with errors. Changing the 
heartbeat code to check for initialization before sending such data. The 
heartbeat will still occur (or else long running initialization will result in 
the task timing out on the am liveliness monitor) but only sending the data is 
guarded. Also, sending stats at the same frequency as counters. Should have 
done this earlier since frequent updates for these could overload the AM 
(similar to the issue with counters).
[~rohini] Since your large jobs frequently repro this error, could you please 
check with this patch? Thanks!


was (Author: bikassaha):
Looking at this a more, the real issue is that heartbeating and sending all 
these objects happen regardless of whether initialization is in progress or 
not. Synchronization will not result in correct data. E.g. sending 2 out of 3 
members is still possible. Besides, accessing these (and other future members) 
while initialization is in progress is fraught with errors. Changing the 
heartbeat code to check for initialization before sending such data. The 
heartbeat will still occur (or else long running initialization will result in 
the task timing out on the am liveliness monitor) but only sending the data is 
guarded. Also, sending stats at the same frequency as counters. Should have 
done this earlier since frequent updates for these could overload the AM 
(similar to the issue with counters).
[~rohini] Since your large jobs frequently repro this error, could you please 
check with this patch? Thanks!

> Tez task attempt failures due to bad event serialization
> --------------------------------------------------------
>
>                 Key: TEZ-2314
>                 URL: https://issues.apache.org/jira/browse/TEZ-2314
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Rohini Palaniswamy
>            Assignee: Bikas Saha
>            Priority: Blocker
>         Attachments: TEZ-2314.1.patch, TEZ-2314.log.patch
>
>
> {code}
> 2015-04-13 19:21:48,516 WARN [Socket Reader #3 for port 53530] ipc.Server: 
> Unable to read call parameters for client 10.216.13.112on connection protocol 
> org.apache.tez.common.TezTaskUmbilicalProtocol for rpcKind RPC_WRITABLE
> java.lang.ArrayIndexOutOfBoundsException: 1935896432
>         at 
> org.apache.tez.runtime.api.impl.EventMetaData.readFields(EventMetaData.java:120)
>         at 
> org.apache.tez.runtime.api.impl.TezEvent.readFields(TezEvent.java:271)
>         at 
> org.apache.tez.runtime.api.impl.TezHeartbeatRequest.readFields(TezHeartbeatRequest.java:110)
>         at 
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
>         at 
> org.apache.hadoop.ipc.WritableRpcEngine$Invocation.readFields(WritableRpcEngine.java:160)
>         at 
> org.apache.hadoop.ipc.Server$Connection.processRpcRequest(Server.java:1884)
>         at 
> org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1816)
>         at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1574)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:806)
>         at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:673)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:644)
> {code}
> cc/ [~hitesh] and [~bikassaha]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to