[
https://issues.apache.org/jira/browse/TEZ-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512053#comment-14512053
]
Bikas Saha edited comment on TEZ-2314 at 4/24/15 11:54 PM:
-----------------------------------------------------------
Looking at this more, the real issue is that heartbeating and sending all these
objects (counters/stats etc) happens regardless of whether initialization is in
progress or not. Synchronization will not result in correct data. E.g. sending
2 out of 3 members is still possible. Besides, accessing these (and other
future members) while initialization is in progress is fraught with errors.
Changing the heartbeat code to check for initialization before sending such
data. The heartbeat will still occur (or else long running initialization will
result in the task timing out on the am liveliness monitor) but only sending
the data is guarded. Also, sending stats at the same frequency as counters.
Should have done this earlier since frequent updates for these could overload
the AM (similar to the issue with counters).
[~rohini] Since your large jobs frequently repro this error, could you please
check with this patch? Thanks!
was (Author: bikassaha):
Looking at this more, the real issue is that heartbeating and sending all these
objects happen regardless of whether initialization is in progress or not.
Synchronization will not result in correct data. E.g. sending 2 out of 3
members is still possible. Besides, accessing these (and other future members)
while initialization is in progress is fraught with errors. Changing the
heartbeat code to check for initialization before sending such data. The
heartbeat will still occur (or else long running initialization will result in
the task timing out on the am liveliness monitor) but only sending the data is
guarded. Also, sending stats at the same frequency as counters. Should have
done this earlier since frequent updates for these could overload the AM
(similar to the issue with counters).
[~rohini] Since your large jobs frequently repro this error, could you please
check with this patch? Thanks!
> Tez task attempt failures due to bad event serialization
> --------------------------------------------------------
>
> Key: TEZ-2314
> URL: https://issues.apache.org/jira/browse/TEZ-2314
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Rohini Palaniswamy
> Assignee: Bikas Saha
> Priority: Blocker
> Attachments: TEZ-2314.1.patch, TEZ-2314.log.patch
>
>
> {code}
> 2015-04-13 19:21:48,516 WARN [Socket Reader #3 for port 53530] ipc.Server:
> Unable to read call parameters for client 10.216.13.112on connection protocol
> org.apache.tez.common.TezTaskUmbilicalProtocol for rpcKind RPC_WRITABLE
> java.lang.ArrayIndexOutOfBoundsException: 1935896432
> at
> org.apache.tez.runtime.api.impl.EventMetaData.readFields(EventMetaData.java:120)
> at
> org.apache.tez.runtime.api.impl.TezEvent.readFields(TezEvent.java:271)
> at
> org.apache.tez.runtime.api.impl.TezHeartbeatRequest.readFields(TezHeartbeatRequest.java:110)
> at
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:285)
> at
> org.apache.hadoop.ipc.WritableRpcEngine$Invocation.readFields(WritableRpcEngine.java:160)
> at
> org.apache.hadoop.ipc.Server$Connection.processRpcRequest(Server.java:1884)
> at
> org.apache.hadoop.ipc.Server$Connection.processOneRpc(Server.java:1816)
> at
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1574)
> at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:806)
> at
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:673)
> at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:644)
> {code}
> cc/ [~hitesh] and [~bikassaha]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)