[
https://issues.apache.org/jira/browse/HBASE-13997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enis Soztutar updated HBASE-13997:
----------------------------------
Attachment: hbase-13997_v2.patch
Thanks [~gzh1992n] for the patch. I was writing a unit test for this, but it
turns out that that part was working and will not cause a client hang. The
off-by-one error is definitely there, but it was not causing a problem because
of a related but different issue.
Some time ago (HBASE-11564), the semantics for
{{ResultBoundedCompletionService}} got changed from being a blocking queue kind
of data structure where you submit multiiple tasks and call take() multiple
times, into one where you submit multiple tasks, and you only take once. The
completed list does not get cleaned when {{take()}} returns. HBASE-11564 did
the changes in Get code-path, but not in the scan code path it seems.
For example, we are submitting 3 calls to the
{{ResultBoundedCompletionService}}, but we had this off-by-one and
{{submitted}} is 4. But, since as soon as the first result comes in, if it is
an exception, we would call {{cs.take()}} 4 times, and each time it will return
the same exception. This does not in fact cause a hang, but still a clean up in
the code is needed.
Attached v2 patch brings the scanner code path to be similar to the get code
path ({{RpcRetryingCallerWithReadReplicas}}). [~devaraj] do you mind taking a
look?
> ScannerCallableWithReplicas cause Infinitely blocking
> -----------------------------------------------------
>
> Key: HBASE-13997
> URL: https://issues.apache.org/jira/browse/HBASE-13997
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 1.0.1.1
> Reporter: Zephyr Guo
> Assignee: Zephyr Guo
> Priority: Minor
> Attachments: HBASE-13997.patch, hbase-13997_v2.patch
>
>
> Bug in ScannerCallableWithReplicas.addCallsForOtherReplicas method
> {code:title=code in ScannerCallableWithReplicas.addCallsForOtherReplicas
> |borderStyle=solid}
> private int addCallsForOtherReplicas(
> BoundedCompletionService<Pair<Result[], ScannerCallable>> cs,
> RegionLocations rl, int min,
> int max) {
> if (scan.getConsistency() == Consistency.STRONG) {
> return 0; // not scheduling on other replicas for strong consistency
> }
> for (int id = min; id <= max; id++) {
> if (currentScannerCallable.getHRegionInfo().getReplicaId() == id) {
> continue; //this was already scheduled earlier
> }
> ScannerCallable s =
> currentScannerCallable.getScannerCallableForReplica(id);
> if (this.lastResult != null) {
> s.getScan().setStartRow(this.lastResult.getRow());
> }
> outstandingCallables.add(s);
> RetryingRPC retryingOnReplica = new RetryingRPC(s);
> cs.submit(retryingOnReplica);
> }
> return max - min + 1; //bug? should be "max - min",because "continue"
> //always happen once
> }
> {code}
> It can cause completed < submitted always so that the following code will be
> infinitely blocked.
> {code:title=code in ScannerCallableWithReplicas.call|borderStyle=solid}
> // submitted larger than the actual one
> submitted += addCallsForOtherReplicas(cs, rl, 0, rl.size() - 1);
> try {
> //here will be affected
> while (completed < submitted) {
> try {
> Future<Pair<Result[], ScannerCallable>> f = cs.take();
> Pair<Result[], ScannerCallable> r = f.get();
> if (r != null && r.getSecond() != null) {
> updateCurrentlyServingReplica(r.getSecond(), r.getFirst(), done,
> pool);
> }
> return r == null ? null : r.getFirst(); // great we got an answer
> } catch (ExecutionException e) {
> // if not cancel or interrupt, wait until all RPC's are done
> // one of the tasks failed. Save the exception for later.
> if (exceptions == null) exceptions = new
> ArrayList<ExecutionException>(rl.size());
> exceptions.add(e);
> completed++;
> }
> }
> } catch (CancellationException e) {
> throw new InterruptedIOException(e.getMessage());
> } catch (InterruptedException e) {
> throw new InterruptedIOException(e.getMessage());
> } finally {
> // We get there because we were interrupted or because one or more of
> the
> // calls succeeded or failed. In all case, we stop all our tasks.
> cs.cancelAll(true);
> }
> {code}
> If all replica-RS occur ExecutionException ,it will be infinitely blocked in
> cs.take()
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)