[ https://issues.apache.org/jira/browse/TEZ-924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated TEZ-924: ------------------------------- Target Version/s: 0.8.0 (was: 0.7.0) > InputFailedEvent handling for Shuffle > ------------------------------------- > > Key: TEZ-924 > URL: https://issues.apache.org/jira/browse/TEZ-924 > Project: Apache Tez > Issue Type: Bug > Reporter: Siddharth Seth > Priority: Critical > > Shuffle receives batches of Events to process from the AM. The way these > events are sent over to the ShuffleHandlers and the way they're processed - > it's possible that Shuffle will start fetching data from an Event, which is > to be subsequently marked as failed (via an InputFailedEvent) > 1) The AM sends events in batches. An InputFailedEvent for a specific Input > may not be part of the same batch which contained the original event which is > being marked bad. > 2) The ShuffleEventHandler processes the events in each batch one event at a > time - so even if the InputFailedEvent follows - it's possible for Shuffle to > start fetching data from a Failed Input. > The AM needs to change to invalidate Inputs up front - so that related events > don't span batches. Alternately, it needs to apply the InputFailedEvent to > the original event being sent. > The Shuffle itself should process a batch update as a batch - that would > prevent fetchers from starting early even though there may be additional > events for the same host. -- This message was sent by Atlassian JIRA (v6.3.4#6332)