[
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190717#comment-15190717
]
Anoop Sam John commented on HBASE-15436:
----------------------------------------
So you say after u see the log abt failure (after some 30+ mins, in fact 36
mins I guess, as 1 min seems socket time out and 36 attempts there), still the
flush is not coming out. So after seeing this log how long u wait?
So this is an async way of write to table.. Ya when the size of accumulated
puts become some configured size, we will do a flush. Till then puts are
accumulated at client side.
I believe I got the issue. This is not a dead lock or so.
To this flush we will pass all the Rows to flush (Write to RS). Rows I mean
Mutations.
It will try to group the mutations per server and will contact each of the
server with List of mutations to go there.
Well to group this it checks the region locations for each of the row. And the
scan happens to META (as shown in logs) and it fails. For the 1st Mutation in
this list itself, it took 36 mins. Because the scan to META has retries. Each
of the trial fails after the SocketTimeout
See in AsyncProcess#submit
{code}
do {
.......
int posInList = -1;
Iterator<? extends Row> it = rows.iterator();
while (it.hasNext()) {
Row r = it.next();
HRegionLocation loc;
try {
if (r == null) throw new IllegalArgumentException("#" + id + ", row
cannot be null");
// Make sure we get 0-s replica.
RegionLocations locs = connection.locateRegion(
tableName, r.getRow(), true, true,
RegionReplicaUtil.DEFAULT_REPLICA_ID);
........
} catch (IOException ex) {
locationErrors = new ArrayList<Exception>();
locationErrorRows = new ArrayList<Integer>();
LOG.error("Failed to get region location ", ex);
// This action failed before creating ars. Retain it, but do not add
to submit list.
// We will then add it to ars in an already-failed state.
retainedActions.add(new Action<Row>(r, ++posInList));
locationErrors.add(ex);
locationErrorRows.add(posInList);
it.remove();
break; // Backward compat: we stop considering actions on location
error.
}
.........
}
} while (retainedActions.isEmpty() && atLeastOne && (locationErrors ==
null));
{code}
The List 'rows' is the same List which BufferedMutatorImpl hold. (ie.
writeAsyncBuffer). So for the 1st Mutation the region location lookup failed
and that Mutation got removed from this List also as u can see. This will
eventually marked as failed op. And the flow comes back to
BufferedMutatorImpl#backgroundFlushCommits
Here we can see
{code}
if (synchronous || ap.hasError()) {
while (!writeAsyncBuffer.isEmpty()) {
ap.submit(tableName, writeAsyncBuffer, true, null, false);
}
{code}
The loop continues till writeAsyncBuffer is non empty. So in this 36 mins we
could remove only one item from the list. Again it goes on and removes the
2nd and so on. So if there are 100 Mutation in the list when we called
flush(), it would get over after 36 * 100 mins !!!!!
Am not much knowing the design consideration of this AsyncProcess etc. May be
we should narrow down the lock on close() method from method level and set some
thing like a closing state to true, the retries within the flows should check
for this state and early out with a fat WARN log saying we will loose some of
the mutations applied till now. (?)
> BufferedMutatorImpl.flush() appears to get stuck
> ------------------------------------------------
>
> Key: HBASE-15436
> URL: https://issues.apache.org/jira/browse/HBASE-15436
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 1.0.2
> Reporter: Sangjin Lee
> Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went
> away when the client was executing flush. The flush eventually logged a
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the
> {{flush()}} call). I would have expected the {{flush()}} call to return after
> the complete failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)