[jira] [Resolved] (ACCUMULO-2339) WAL recovery fails

Mike Drob (JIRA) Tue, 22 Apr 2014 19:29:33 -0700

     [ 
https://issues.apache.org/jira/browse/ACCUMULO-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mike Drob resolved ACCUMULO-2339.
---------------------------------

    Resolution: Cannot Reproduce

Closing as Cannot Reproduce per Eric's comments. If somebody sees this again, 
please file a follow-on JIRA.

> WAL recovery fails
> ------------------
>
>                 Key: ACCUMULO-2339
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2339
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.5.0
>         Environment: testing 1.5.1rc1 on a 10 node cluster, hadoop 2.2.0, zk 
> 3.4.5
>            Reporter: Eric Newton
>            Priority: Critical
>
> I was running accumulo 1.5.1rc1 on a 10 node cluster. After two days, I saw 
> that several tservers had died with OOME.  Several hundred tablets were 
> offline.
> The master was attempting to recover the write lease on the file, and this 
> was failing.
> Attempts to examine the log file failed: 
> {noformat}
> $ hadoop fs -cat 
> /accumulo/wal/192.168.1.5+9997/bc94602a-9a57-45f6-afdf-ffa2a5b70b14
> Cannot obtain block length for 
> LocatedBlock{BP-901421341-192.168.1.3-1389719663617:blk_1076582460_2869891; 
> getBlockSize()=0; corrupt=false; offset=0; locs=[192.168.1.5:50010]}
> {noformat}
> Looking at the DN logs, I see this:
> {noformat}
> 2014-02-06 12:48:35,798 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> NameNode at host2/192.168.1.3:9000 calls 
> recoverBlock(BP-901421341-192.168.1.3-1389719663617:blk_1076582290_2869721, 
> targets=[192.168.1.5:50010], newGenerationStamp=2880680)
> 2014-02-06 12:48:35,798 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> initReplicaRecovery: blk_1076582290_2869721, recoveryId=2880680, 
> replica=ReplicaBeingWritten, blk_1076582290_2869721, RBW
>   getNumBytes()     = 634417185
>   getBytesOnDisk()  = 634417113
>   getVisibleLength()= 634417113
>   getVolume()       = /srv/hdfs4/hadoop/dn/current
>   getBlockFile()    = 
> /srv/hdfs4/hadoop/dn/current/BP-901421341-192.168.1.3-1389719663617/current/rbw/blk_1076582290
>   bytesAcked=634417113
>   bytesOnDisk=634417113
> {noformat}
> I'm guessing that the /srv/hdfs4 partition was filled up, and disagreement 
> about the size of the file and the size the DN thinks the file should be is 
> causing failures.
> Restarting HDFS made no difference.
> I manually copied the block up into HDFS as the WAL to make any progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (ACCUMULO-2339) WAL recovery fails

Reply via email to