[ 
https://issues.apache.org/jira/browse/TEZ-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963234#comment-15963234
 ] 

Jonathan Eagles commented on TEZ-3685:
--------------------------------------

[~sseth]
- The unordered and ordered case did have an off by one issue. Fixed this up.
- The unordered shuffle manager uses a mix Reentrant locks and synchronization 
and causes a difficulty in avoiding deadlocks scenarios. Moved totally to using 
reentrant locks. Re-worked the locking/unlocking to keep correct 
synchronization.
- The srcAttemptRemaining verify sanity check was found to be redundant, since 
the inputAttemptIdentifier has already been verified by looking up the 
identifier in the pathToAttemptMap that is generated from the 
srcAttemptsRemaining.

> ShuffleHandler completedInputSet off-by-one error
> -------------------------------------------------
>
>                 Key: TEZ-3685
>                 URL: https://issues.apache.org/jira/browse/TEZ-3685
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>         Attachments: TEZ-3685.1.patch
>
>
> Per comment in TEZ-3334 
> https://issues.apache.org/jira/browse/TEZ-3334?focusedCommentId=15950550&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15950550
> Addressing The off by 1 issue / other correctness issues in this jira
> - ShuffleManager: has moved to using a BitSet instead of a ConcurrentSet. May 
> need to look at the synchronization points required for the bitset. (access 
> to the map had some synchronized and some non synchronized access)
> - Why was this removed?
> {code}
> -    // Sanity check
> -    // we are guaranteed that key is not null
> -    if (srcAttemptsRemaining.get(srcAttemptId.toString()) == null) {
> -      // wrongMapErrs.increment(1);
> -      LOG.warn("Invalid input. Received output for headerPathComponent: "
> -          + pathComponent + "nextRemainingSrcAttemptId: "
> -          + getNextRemainingAttempt() + ", mappedSrcAttemptId: " + 
> srcAttemptId);
> -      return false;
> -    }
> {code}
> - 
> Is there an off by 1 error here? Specifically. input.getInputIdentifier() + 
> ((CompositeInputAttemptIdentifier) input).getInputIdentifierCount(). Think 
> getInputIdentifierCount is always at least 1.
> {code}
> +      if(input instanceof CompositeInputAttemptIdentifier) {
> +        // LLL Is there an off by 0 case here. i.e. getInputIdentiferCount 
> will be set to 1 for a regular event.
> +        // We start scanning the bitst early. Also depends on how 
> completedInputSet is setup.
> +        if (completedInputSet.nextClearBit(input.getInputIdentifier()) >=
> +            input.getInputIdentifier() + ((CompositeInputAttemptIdentifier) 
> input).getInputIdentifierCount()) {
> +          inputIter.remove();
> +          continue;
> +        }
> +      } else {
> +        if (completedInputSet.get(input.getInputIdentifier())) {
> +          inputIter.remove();
> +          continue;
> +        }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to