Hello Accumulo wizards, I have a large schema of test data in an Accumulo instance that is currently inaccessible which I would like to recover, if possible. I'll explain the problem in hopes that some folks who know the intricacies of the Accumulo root table, WAL, and recovery processes can tell me whether there are any additional actions to take or whether I should treat this schema as hosed.
The problem is similar to what was reported here (https://community.hortonworks.com/questions/52718/failed-to-locate-tablet-for-table-0-row-err.html), i.e. no tablets are loaded except one from accumulo.root, and the logs are repeating these message rapidly: ==> monitor_stti-master.bbn.com.debug.log <== 2017-04-21 07:10:55,047 [impl.ThriftScanner] DEBUG: Failed to locate tablet for table : !0 row : ~err_ ==> master_stti-master.bbn.com.debug.log <== 2017-04-21 07:10:55,430 [master.Master] DEBUG: Finished gathering information from 13 servers in 0.03 seconds 2017-04-21 07:10:55,430 [master.Master] DEBUG: not balancing because there are unhosted tablets: 2 The RecoveryManager insists that it is trying to recover five WALs: 2017-04-21 07:28:48,349 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/0d28801e-322e-44e6-97e3-a34a14b4bd1a 2017-04-21 07:28:48,358 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9 to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/696d4353-0041-4397-a1f5-b8600b5cb2e9 2017-04-21 07:28:48,362 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3 to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/e62f4195-c7d6-419a-a696-ff89b10cecc3 2017-04-21 07:28:48,366 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/01a0887e-4ac8-4772-8f5f-b99371e1df0a 2017-04-21 07:28:48,369 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105 to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/6f392ec5-821b-4fd5-83e4-baf1f47d8105 Based on the advice from the post linked above, I grepped the logs and was able to confirm that all five of those WALs were actually deleted (here's the output from my grep; note the earlier timestamps): gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,275 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105] from stti-data-102.bbn.com+10011 gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,280 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3] from stti-data-103.bbn.com+10011 gc_stti-master.bbn.com.debug.log.3:2017-04-03 20:25:26,699 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a] from stti-data-103.bbn.com+10011 gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:32:11,106 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a] from stti-data-102.bbn.com+10011 gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:37:14,875 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9] from stti-data-103.bbn.com+10011 All five WALs appear in references in the accumulo.root table: !0;~ log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a [] hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a|1 !0;~ log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9 [] hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9|1 !0;~ log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3 [] hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3|1 ... !0< log:stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a [] hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a|1 !0< log:stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105 [] hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105|1 I also see observe three outstanding fate transactions (at least two of which appear to me to be related to the accumulo.root table): root@bbn-beta> fate print txid: 6b33fa130909f05d status: IN_PROGRESS op: CompactRange locked: [R:+accumulo, R:!0] locking: [] top: CompactionDriver txid: 564d758d584af61e status: IN_PROGRESS op: CompactRange locked: [R:+accumulo, R:!0] locking: [] top: CompactionDriver txid: 4a620317a53a4a93 status: IN_PROGRESS op: CreateTable locked: [W:5e, R:+default] locking: [] top: PopulateMetadata I checked in ZooKeeper and the /accumulo/$INSTANCE/root_tablet/walogs and /accumulo/$INSTANCE/recovery/[locks] directories are all empty. I don't know exactly what to do at this point. I could: a) Try deleting the fate operations and see if that releases the Accumulo instance. b) Try deleting the accumulo.root table entries pointing to the already-deleted WALs. c) Call it quits on this instance, blow it away, and start re-generating my test data over the weekend. Given option (c), I would most likely try options (a) and (b) first (and probably in that order). But I would love to get some insight from the Accumulo experts first. Thanks in advance, Jonathan
