[jira] [Updated] (HBASE-13997) ScannerCallableWithReplicas cause Infinitely blocking

Enis Soztutar (JIRA) Wed, 08 Jul 2015 17:37:34 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-13997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Enis Soztutar updated HBASE-13997:
----------------------------------
    Attachment: hbase-13997_v2.patch

Thanks [~gzh1992n] for the patch. I was writing a unit test for this, but it 
turns out that that part was working and will not cause a client hang. The 
off-by-one error is definitely there, but it was not causing a problem because 
of a related but different issue. 

Some time ago (HBASE-11564), the semantics for 
{{ResultBoundedCompletionService}} got changed from being a blocking queue kind 
of data structure where you submit multiiple tasks and call take() multiple 
times, into one where you submit multiple tasks, and you only take once. The 
completed list does not get cleaned when {{take()}} returns. HBASE-11564 did 
the changes in Get code-path, but not in the scan code path it seems. 

For example, we are submitting 3 calls to the 
{{ResultBoundedCompletionService}}, but we had this off-by-one and 
{{submitted}} is 4. But, since as soon as the first result comes in, if it is 
an exception, we would call {{cs.take()}} 4 times, and each time it will return 
the same exception. This does not in fact cause a hang, but still a clean up in 
the code is needed. 

Attached v2 patch brings the scanner code path to be similar to the get code 
path ({{RpcRetryingCallerWithReadReplicas}}). [~devaraj] do you mind taking a 
look? 



> ScannerCallableWithReplicas cause Infinitely blocking
> -----------------------------------------------------
>
>                 Key: HBASE-13997
>                 URL: https://issues.apache.org/jira/browse/HBASE-13997
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.0.1.1
>            Reporter: Zephyr Guo
>            Assignee: Zephyr Guo
>            Priority: Minor
>         Attachments: HBASE-13997.patch, hbase-13997_v2.patch
>
>
> Bug in ScannerCallableWithReplicas.addCallsForOtherReplicas method  
> {code:title=code in ScannerCallableWithReplicas.addCallsForOtherReplicas 
> |borderStyle=solid}
> private int addCallsForOtherReplicas(
>       BoundedCompletionService<Pair<Result[], ScannerCallable>> cs, 
> RegionLocations rl, int min,
>       int max) {
>     if (scan.getConsistency() == Consistency.STRONG) {
>       return 0; // not scheduling on other replicas for strong consistency
>     }
>     for (int id = min; id <= max; id++) {
>       if (currentScannerCallable.getHRegionInfo().getReplicaId() == id) {
>         continue; //this was already scheduled earlier
>       }
>       ScannerCallable s = 
> currentScannerCallable.getScannerCallableForReplica(id);
>       if (this.lastResult != null) {
>         s.getScan().setStartRow(this.lastResult.getRow());
>       }
>       outstandingCallables.add(s);
>       RetryingRPC retryingOnReplica = new RetryingRPC(s);
>       cs.submit(retryingOnReplica);
>     }
>     return max - min + 1;     //bug? should be "max - min",because "continue"
>                                         //always happen once
>   }
> {code}
> It can cause completed < submitted always so that the following code will be 
> infinitely blocked.
> {code:title=code in ScannerCallableWithReplicas.call|borderStyle=solid}
> // submitted larger than the actual one
>  submitted += addCallsForOtherReplicas(cs, rl, 0, rl.size() - 1);
>     try {
>       //here will be affected
>       while (completed < submitted) {
>         try {
>           Future<Pair<Result[], ScannerCallable>> f = cs.take();
>           Pair<Result[], ScannerCallable> r = f.get();
>           if (r != null && r.getSecond() != null) {
>             updateCurrentlyServingReplica(r.getSecond(), r.getFirst(), done, 
> pool);
>           }
>           return r == null ? null : r.getFirst(); // great we got an answer
>         } catch (ExecutionException e) {
>           // if not cancel or interrupt, wait until all RPC's are done
>           // one of the tasks failed. Save the exception for later.
>           if (exceptions == null) exceptions = new 
> ArrayList<ExecutionException>(rl.size());
>           exceptions.add(e);
>           completed++;
>         }
>       }
>     } catch (CancellationException e) {
>       throw new InterruptedIOException(e.getMessage());
>     } catch (InterruptedException e) {
>       throw new InterruptedIOException(e.getMessage());
>     } finally {
>       // We get there because we were interrupted or because one or more of 
> the
>       // calls succeeded or failed. In all case, we stop all our tasks.
>       cs.cancelAll(true);
>     }
> {code}
> If all replica-RS occur ExecutionException ,it will be infinitely blocked in  
> cs.take()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13997) ScannerCallableWithReplicas cause Infinitely blocking

Reply via email to