Josh Elser created ACCUMULO-2053:
------------------------------------

             Summary: Slow reassignment after failure and recovery
                 Key: ACCUMULO-2053
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2053
             Project: Accumulo
          Issue Type: Improvement
          Components: master
         Environment: 5bb28edb with Hadoop 2.2.0
            Reporter: Josh Elser


Running CI, I noticed the following situation. Agitation killed a tabletserver. 
Recovery was performed, but the tablets were not yet reassigned as reported by 
the monitor. A minute had gone by and there were still a significant number of 
tablets (~15 out of 150) still offline for a single table. One at a time, the 
tablets went from unassigned to assigned.

Tail'ing the master log, this was confirmed, as I saw the following lines 
repeated for every offline tablet:

{noformat}
2013-12-17 21:10:52,615 [recovery.RecoveryManager] DEBUG: Recovering 
hdfs://nameservice/accumulo/wal/tserver1+9997/0a60966c-b72d-4643-bf39-3fbfec342cc0
 to hdfs://namenode/accumulo/recovery/0a60966c-b72d-4643-bf39-3fbfec342cc0
2013-12-17 21:10:52,624 [recovery.RecoveryManager] DEBUG: Recovering 
hdfs://nameservice/accumulo/wal/tserver1+9997/327e38cb-9f96-41a4-baff-a97d89d523e9
 to hdfs://nameservice/accumulo/recovery/327e38cb-9f96-41a4-baff-a97d89d523e9
{noformat}

It seems like we should be able to bring all of these tablets back online at 
once (or at least more than one every 10 seconds as the log showed) because the 
recovery file was created. This made the complete recovery process take a bit 
longer than it should have as we waited 150s before reassigning the last tablet.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to