[
https://issues.apache.org/jira/browse/ACCUMULO-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Newton resolved ACCUMULO-1364.
-----------------------------------
Resolution: Fixed
Reopen if you find problems.
> Silent failure after power outage
> ---------------------------------
>
> Key: ACCUMULO-1364
> URL: https://issues.apache.org/jira/browse/ACCUMULO-1364
> Project: Accumulo
> Issue Type: Sub-task
> Components: master, tserver
> Environment: hadoop-1.0.4, accumulo-1.5-SNAPSHOT svn version 1470047
> Reporter: John Vines
> Assignee: Eric Newton
> Priority: Blocker
> Fix For: 1.5.0
>
>
> We were doing some testing on an Accumulo snapshot using continuous ingest
> when the power went out. When it came back we noticed some corrupt blocks in
> HDFS, mostly around the WAL. I wasn't certain if that was a happenstance of
> how the sync blocks can turn out, so I went ahead and started Accumulo to see
> if it could handle it. What I got wasn't what I expected.
> There are 0 errors reported on the monitor. It just sits with 5 tservers
> available and no tablets online. The master appears it attempted to assign
> and then is waiting for the walog to close, which never happens-
> {quote} 2013-04-30 10:38:23,648 [master.EventCoordinator] INFO : There are
> now 5 tablet servers
> 2013-04-30 10:38:23,719 [state.ZooTabletStateStore] DEBUG: root tablet logSet
> [172.16.102.202+9997/fa545e93-5eba-46b4-9266-dbd60cb56943]
> 2013-04-30 10:38:23,720 [state.ZooTabletStateStore] DEBUG: root tablet logSet
> [172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462]
> 2013-04-30 10:38:23,725 [state.ZooTabletStateStore] DEBUG: Returning root
> tablet state:
> !0;!0<<@(null,172.16.102.202:9997[33e57eff04c0001],172.16.102.202:9997[33e57eff04c0001])
> 2013-04-30 10:38:23,740 [master.Master] INFO : Loaded class :
> org.apache.accumulo.server.master.recovery.HadoopLogCloser
> 2013-04-30 10:38:23,741 [recovery.RecoveryManager] INFO : Starting recovery
> of ed30bd24-b348-4344-8614-a2d79f933462 (in : 10s) created for
> 172.16.102.202+9997, tablet !0;!0<< holds a reference
> 2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet]: scan time 0.04
> seconds
> 2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet] sleeping for
> 60.00 seconds
> 2013-04-30 10:38:23,823 [metrics.MetricsConfiguration] DEBUG: Loading config
> file:
> /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumulo-metrics.xml
> 2013-04-30 10:38:23,838 [master.Master] DEBUG: Finished gathering information
> from 5 servers in 0.21 seconds
> 2013-04-30 10:38:23,841 [master.Master] DEBUG: not balancing because there
> are unhosted tablets
> 2013-04-30 10:38:23,852 [master.Master] DEBUG: Finished gathering information
> from 5 servers in 0.01 seconds
> 2013-04-30 10:38:23,852 [master.Master] DEBUG: not balancing because there
> are unhosted tablets
> 2013-04-30 10:38:23,861 [metrics.MetricsConfiguration] DEBUG: Metrics
> collection enabled=false
> 2013-04-30 10:38:23,874 [impl.ThriftScanner] DEBUG: Error getting transport
> to 172.16.102.202:9997 : NotServingTabletException(extent:TKeyExtent(table:21
> 30, endRow:21 30 3C, prevEndRow:null))
> {quote}
> That Exception repeats endlessly with periodic
> bq. 2013-04-30 10:38:34,756 [recovery.HadoopLogCloser] INFO : Waiting for
> file to be closed
> /accumulo/wal/172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462
> On the tserver in question, it seems to have no idea that it's supposed to be
> recovering the root tablet though
> {quote}
> 2013-04-30 10:38:22,432 [tabletserver.TabletServer] DEBUG:
> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler
> created
> 2013-04-30 10:38:22,544 [metrics.MetricsConfiguration] DEBUG: Loading config
> file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumu
> lo-metrics.xml
> 2013-04-30 10:38:22,549 [metrics.MetricsConfiguration] DEBUG: Metrics
> collection enabled=false
> 2013-04-30 10:38:22,551 [tabletserver.TabletServer] INFO : port = 9997
> 2013-04-30 10:38:22,621 [tabletserver.TabletServer] DEBUG: Obtained tablet
> server lock /accumulo/242078a7-dd19-4d08-8952-f5109f6f7962/tservers/172.16
> .102.202:9997/zlock-0000000000
> 2013-04-30 10:38:23,266 [tabletserver.TabletServer] DEBUG: gc
> ParNew=0.00(+0.00) secs ConcurrentMarkSweep=0.00(+0.00) secs
> freemem=8,486,794,504(+45,
> 036,880) totalmem=8,536,260,608
> 2013-04-30 10:38:23,947 [tabletserver.TabletServer] DEBUG: MultiScanSess
> 172.16.102.200:50034 0 entries in 0.07 secs (lookup_time:0.00 secs tablets:1
> ranges:1)
> 2013-04-30 10:38:23,986 [tabletserver.TabletServer] DEBUG: MultiScanSess
> 172.16.102.200:50034 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
> ranges:1)
> {quote}
> With that debug message repeating endlessly. Out and err files on the master
> and that tserver are empty.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira