[ 
https://issues.apache.org/jira/browse/TEZ-3912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531942#comment-16531942
 ] 

Kuhu Shukla commented on TEZ-3912:
----------------------------------

[~jeagles], thank you so much for the review. You have helped catch a bug where 
TEZ-3833 did not quite fix the problem we had I think. The InternalError was 
being wrapped as IOE anyway so I am going to try and see if and why the 
remaining Map getting populated back by the srcAttemptID was not enough in the 
first place. 

As for this patch, the wrapping of IOE into another IOE will break shouldRetry 
logic for SocketTimeoutException. Will rework my patch based on above 
observations.

> Fetchers should be more robust to corrupted inputs
> --------------------------------------------------
>
>                 Key: TEZ-3912
>                 URL: https://issues.apache.org/jira/browse/TEZ-3912
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jason Lowe
>            Assignee: Kuhu Shukla
>            Priority: Major
>         Attachments: TEZ-3912.001.patch
>
>
> I recently saw a case where a bad node in the cluster produced corrupted 
> shuffle data that caused the codec to throw IllegalArgumentException when 
> trying to fetch.  Fetchers currently only handle IOException and 
> InternalError, and any other type of exception will cause the entire task to 
> be torn down.  We should consider catching Exception like MapReduce does to 
> be more robust in light of other types of errors coming from the codec and 
> allow retries to occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to