[
https://issues.apache.org/jira/browse/TEZ-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619668#comment-16619668
]
Jonathan Eagles edited comment on TEZ-3990 at 9/18/18 8:26 PM:
---------------------------------------------------------------
[~kshukla],
I think this is a good start to this patch. A couple of improvements I'd like
to discuss.
- Instead of reading the configuration each time penalizeHost is called, please
initialize this value in the constructor.
- Default limit reduction. 5 hours seems too long since this represents 35
failures and failures have persisted for 24 hours at this point. How does 10
minute limit sound?
Beyond the scope of this JIRA.
- Separate the delay calculation from the delay container so that it can be
tested separately.
- Switch from System.currentTimeMillis to monotonic clock. Since clock skew can
cause the penalty queue to miscalculate await time.
- User provided clock for testability. Instead of racing to check the delay
queue before the penalties are removed, we can control the clock so that we
know exactly what is in the delay queue
- I would also replace the compareTo method of Penalty with
Long.compare(long,long) to not have to duplicate the logic.
was (Author: jeagles):
[~kshukla],
I think this is a good start to this patch. A couple of improvements I'd like
to discuss.
- Instead of reading the configuration each time penalizeHost is called, please
initialize this value in the constructor.
- Default limit reduction. 5 hours seems too long since this represents 35
failures and failures have persisted for 24 hours at this point. How does 10
minute limit sound?
Beyond the scope of this JIRA.
- Separate the delay calculation from the delay container so that it can be
tested separately.
- Switch from System.currentTimeMillis to monotonic clock. Since clock skew can
cause the penalty queue to miscalculate await time.
- User provided clock for testability. Instead of racing to check the delay
queue before the penalties are removed, we can control the clock so that we
know exactly what is in the delay queue
> The number of shuffle penalties for a host/inputAttemptIdentifier should be
> capped
> ----------------------------------------------------------------------------------
>
> Key: TEZ-3990
> URL: https://issues.apache.org/jira/browse/TEZ-3990
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.9.1, 0.10.0
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Priority: Major
> Attachments: TEZ-3990.001.patch, TEZ-3990.002.patch,
> TEZ-3990.003.patch
>
>
> In a scenario where the same mapId fetches fail, the penalty code allows
> adding the same Host/InputAttemptIdentifier over and over with revised
> penalty time that grows exponentially. It should at some point drop the
> retrying and report failure to the AM asap to allow the job to rectify the
> upstream output.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)