[ 
https://issues.apache.org/jira/browse/ACCUMULO-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Newton resolved ACCUMULO-1364.
-----------------------------------

    Resolution: Fixed

Reopen if you find problems.
                
> Silent failure after power outage
> ---------------------------------
>
>                 Key: ACCUMULO-1364
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-1364
>             Project: Accumulo
>          Issue Type: Sub-task
>          Components: master, tserver
>         Environment: hadoop-1.0.4, accumulo-1.5-SNAPSHOT svn version 1470047
>            Reporter: John Vines
>            Assignee: Eric Newton
>            Priority: Blocker
>             Fix For: 1.5.0
>
>
> We were doing some testing on an Accumulo snapshot using continuous ingest 
> when the power went out. When it came back we noticed some corrupt blocks in 
> HDFS, mostly around the WAL. I wasn't certain if that was a happenstance of 
> how the sync blocks can turn out, so I went ahead and started Accumulo to see 
> if it could handle it. What I got wasn't what I expected.
> There are 0 errors reported on the monitor. It just sits with 5 tservers 
> available and no tablets online. The master appears it attempted to assign 
> and then is waiting for the walog to close, which never happens-
> {quote} 2013-04-30 10:38:23,648 [master.EventCoordinator] INFO : There are 
> now 5 tablet servers
> 2013-04-30 10:38:23,719 [state.ZooTabletStateStore] DEBUG: root tablet logSet 
> [172.16.102.202+9997/fa545e93-5eba-46b4-9266-dbd60cb56943]
> 2013-04-30 10:38:23,720 [state.ZooTabletStateStore] DEBUG: root tablet logSet 
> [172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462]
> 2013-04-30 10:38:23,725 [state.ZooTabletStateStore] DEBUG: Returning root 
> tablet state: 
> !0;!0<<@(null,172.16.102.202:9997[33e57eff04c0001],172.16.102.202:9997[33e57eff04c0001])
> 2013-04-30 10:38:23,740 [master.Master] INFO : Loaded class : 
> org.apache.accumulo.server.master.recovery.HadoopLogCloser
> 2013-04-30 10:38:23,741 [recovery.RecoveryManager] INFO : Starting recovery 
> of ed30bd24-b348-4344-8614-a2d79f933462 (in : 10s) created for 
> 172.16.102.202+9997, tablet !0;!0<< holds a reference
> 2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet]: scan time 0.04 
> seconds
> 2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet] sleeping for 
> 60.00 seconds
> 2013-04-30 10:38:23,823 [metrics.MetricsConfiguration] DEBUG: Loading config 
> file: 
> /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumulo-metrics.xml
> 2013-04-30 10:38:23,838 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.21 seconds
> 2013-04-30 10:38:23,841 [master.Master] DEBUG: not balancing because there 
> are unhosted tablets
> 2013-04-30 10:38:23,852 [master.Master] DEBUG: Finished gathering information 
> from 5 servers in 0.01 seconds
> 2013-04-30 10:38:23,852 [master.Master] DEBUG: not balancing because there 
> are unhosted tablets
> 2013-04-30 10:38:23,861 [metrics.MetricsConfiguration] DEBUG: Metrics 
> collection enabled=false
> 2013-04-30 10:38:23,874 [impl.ThriftScanner] DEBUG: Error getting transport 
> to 172.16.102.202:9997 : NotServingTabletException(extent:TKeyExtent(table:21 
> 30, endRow:21 30 3C, prevEndRow:null))
>  {quote}
> That Exception repeats endlessly with periodic
> bq. 2013-04-30 10:38:34,756 [recovery.HadoopLogCloser] INFO : Waiting for 
> file to be closed 
> /accumulo/wal/172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462
> On the tserver in question, it seems to have no idea that it's supposed to be 
> recovering the root tablet though
> {quote}
> 2013-04-30 10:38:22,432 [tabletserver.TabletServer] DEBUG: 
> org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler 
> created
> 2013-04-30 10:38:22,544 [metrics.MetricsConfiguration] DEBUG: Loading config 
> file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumu
> lo-metrics.xml
> 2013-04-30 10:38:22,549 [metrics.MetricsConfiguration] DEBUG: Metrics 
> collection enabled=false
> 2013-04-30 10:38:22,551 [tabletserver.TabletServer] INFO : port = 9997
> 2013-04-30 10:38:22,621 [tabletserver.TabletServer] DEBUG: Obtained tablet 
> server lock /accumulo/242078a7-dd19-4d08-8952-f5109f6f7962/tservers/172.16
> .102.202:9997/zlock-0000000000
> 2013-04-30 10:38:23,266 [tabletserver.TabletServer] DEBUG: gc 
> ParNew=0.00(+0.00) secs ConcurrentMarkSweep=0.00(+0.00) secs 
> freemem=8,486,794,504(+45,
> 036,880) totalmem=8,536,260,608
> 2013-04-30 10:38:23,947 [tabletserver.TabletServer] DEBUG: MultiScanSess 
> 172.16.102.200:50034 0 entries in 0.07 secs (lookup_time:0.00 secs tablets:1
>  ranges:1) 
> 2013-04-30 10:38:23,986 [tabletserver.TabletServer] DEBUG: MultiScanSess 
> 172.16.102.200:50034 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
>  ranges:1) 
> {quote}
> With that debug message repeating endlessly. Out and err files on the master 
> and that tserver are empty.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to