[ 
https://issues.apache.org/jira/browse/HBASE-15436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190717#comment-15190717
 ] 

Anoop Sam John commented on HBASE-15436:
----------------------------------------

So you say after u see the log abt failure (after some 30+ mins, in fact 36 
mins I guess, as 1 min seems socket time out and 36 attempts there), still the 
flush is not coming out. So after seeing this log how long u wait?
So this is an async way of write to table.. Ya when the size of accumulated 
puts become some configured size, we will do a flush. Till then puts are 
accumulated at client side.
I believe I got the issue. This is not a dead lock or so.  
To this flush we will pass all the Rows to flush (Write to RS).  Rows I mean 
Mutations.
It will try to group the mutations per server and will contact each of the 
server with List of mutations to go there.
Well to group this it checks the region locations for each of the row. And the 
scan happens to META (as shown in logs) and it fails.  For the 1st Mutation in 
this list itself, it took 36 mins.  Because the scan to META has retries.  Each 
of the trial fails after the SocketTimeout

See in AsyncProcess#submit
{code}
do {
      .......
      int posInList = -1;
      Iterator<? extends Row> it = rows.iterator();
      while (it.hasNext()) {
        Row r = it.next();
        HRegionLocation loc;
        try {
          if (r == null) throw new IllegalArgumentException("#" + id + ", row 
cannot be null");
          // Make sure we get 0-s replica.
          RegionLocations locs = connection.locateRegion(
              tableName, r.getRow(), true, true, 
RegionReplicaUtil.DEFAULT_REPLICA_ID);
          ........
        } catch (IOException ex) {
          locationErrors = new ArrayList<Exception>();
          locationErrorRows = new ArrayList<Integer>();
          LOG.error("Failed to get region location ", ex);
          // This action failed before creating ars. Retain it, but do not add 
to submit list.
          // We will then add it to ars in an already-failed state.
          retainedActions.add(new Action<Row>(r, ++posInList));
          locationErrors.add(ex);
          locationErrorRows.add(posInList);
          it.remove();
          break; // Backward compat: we stop considering actions on location 
error.
        }

       .........
      }
    } while (retainedActions.isEmpty() && atLeastOne && (locationErrors == 
null));
{code}
The List 'rows' is the same List which BufferedMutatorImpl hold. (ie. 
writeAsyncBuffer).   So for the 1st Mutation the region location lookup failed 
and that Mutation got removed from this List also as u can see.  This will 
eventually marked as failed op. And the flow comes back to 
BufferedMutatorImpl#backgroundFlushCommits
Here we can see
{code}
if (synchronous || ap.hasError()) {
        while (!writeAsyncBuffer.isEmpty()) {
          ap.submit(tableName, writeAsyncBuffer, true, null, false);
        }
{code}
The loop continues till writeAsyncBuffer is non empty.  So in this 36 mins we 
could remove only one item from the list.  Again it goes on and removes the  
2nd and so on.   So if there are 100 Mutation in the list when we called 
flush(), it would get over after  36 * 100 mins  !!!!!

Am not much knowing the design consideration of this AsyncProcess etc.   May be 
we should narrow down the lock on close() method from method level and set some 
thing like a closing state to true, the retries within the flows should check 
for this state and early out with a fat WARN log saying we will loose some of 
the mutations applied till now. (?)

> BufferedMutatorImpl.flush() appears to get stuck
> ------------------------------------------------
>
>                 Key: HBASE-15436
>                 URL: https://issues.apache.org/jira/browse/HBASE-15436
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.0.2
>            Reporter: Sangjin Lee
>         Attachments: hbaseException.log, threaddump.log
>
>
> We noticed an instance where the thread that was executing a flush 
> ({{BufferedMutatorImpl.flush()}}) got stuck when the (local one-node) cluster 
> shut down and was unable to get out of that stuck state.
> The setup is a single node HBase cluster, and apparently the cluster went 
> away when the client was executing flush. The flush eventually logged a 
> failure after 30+ minutes of retrying. That is understandable.
> What is unexpected is that thread is stuck in this state (i.e. in the 
> {{flush()}} call). I would have expected the {{flush()}} call to return after 
> the complete failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to