[
https://issues.apache.org/jira/browse/TEZ-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963234#comment-15963234
]
Jonathan Eagles commented on TEZ-3685:
--------------------------------------
[~sseth]
- The unordered and ordered case did have an off by one issue. Fixed this up.
- The unordered shuffle manager uses a mix Reentrant locks and synchronization
and causes a difficulty in avoiding deadlocks scenarios. Moved totally to using
reentrant locks. Re-worked the locking/unlocking to keep correct
synchronization.
- The srcAttemptRemaining verify sanity check was found to be redundant, since
the inputAttemptIdentifier has already been verified by looking up the
identifier in the pathToAttemptMap that is generated from the
srcAttemptsRemaining.
> ShuffleHandler completedInputSet off-by-one error
> -------------------------------------------------
>
> Key: TEZ-3685
> URL: https://issues.apache.org/jira/browse/TEZ-3685
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jonathan Eagles
> Assignee: Jonathan Eagles
> Attachments: TEZ-3685.1.patch
>
>
> Per comment in TEZ-3334
> https://issues.apache.org/jira/browse/TEZ-3334?focusedCommentId=15950550&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15950550
> Addressing The off by 1 issue / other correctness issues in this jira
> - ShuffleManager: has moved to using a BitSet instead of a ConcurrentSet. May
> need to look at the synchronization points required for the bitset. (access
> to the map had some synchronized and some non synchronized access)
> - Why was this removed?
> {code}
> - // Sanity check
> - // we are guaranteed that key is not null
> - if (srcAttemptsRemaining.get(srcAttemptId.toString()) == null) {
> - // wrongMapErrs.increment(1);
> - LOG.warn("Invalid input. Received output for headerPathComponent: "
> - + pathComponent + "nextRemainingSrcAttemptId: "
> - + getNextRemainingAttempt() + ", mappedSrcAttemptId: " +
> srcAttemptId);
> - return false;
> - }
> {code}
> -
> Is there an off by 1 error here? Specifically. input.getInputIdentifier() +
> ((CompositeInputAttemptIdentifier) input).getInputIdentifierCount(). Think
> getInputIdentifierCount is always at least 1.
> {code}
> + if(input instanceof CompositeInputAttemptIdentifier) {
> + // LLL Is there an off by 0 case here. i.e. getInputIdentiferCount
> will be set to 1 for a regular event.
> + // We start scanning the bitst early. Also depends on how
> completedInputSet is setup.
> + if (completedInputSet.nextClearBit(input.getInputIdentifier()) >=
> + input.getInputIdentifier() + ((CompositeInputAttemptIdentifier)
> input).getInputIdentifierCount()) {
> + inputIter.remove();
> + continue;
> + }
> + } else {
> + if (completedInputSet.get(input.getInputIdentifier())) {
> + inputIter.remove();
> + continue;
> + }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)