The IOE in the below has a matching complaint in the data node. Here are the matching client and datanode exceptions:

CLIENT:

2007-11-28 21:38:39,391 DEBUG hbase.HStore - started compaction of 3 files in 
/hbase/compaction.dir/hregion_-1314266377/cookie
2007-11-28 21:38:39,398 DEBUG fs.DFSClient - Failed to connect to 
/XX.XX.XX.17:50010:java.io.IOException: Got error in response to OP_READ_BLOCK
        at 
org.apache.hadoop.dfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:753)
        at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:979)
        at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1075)
        at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1027)
        at java.io.FilterInputStream.read(FilterInputStream.java:66)
        at java.io.DataInputStream.readByte(DataInputStream.java:248)
        at org.apache.hadoop.hbase.HStoreFile.loadInfo(HStoreFile.java:590)
        at org.apache.hadoop.hbase.HStore.compact(HStore.java:1007)
        at org.apache.hadoop.hbase.HRegion.compactStores(HRegion.java:745)
        at org.apache.hadoop.hbase.HRegion.compactIfNeeded(HRegion.java:704)
        at 
org.apache.hadoop.hbase.HRegionServer$Compactor.run(HRegionServer.java:378)



DATANODE:

2007-11-28 21:38:39,397 WARN  dfs.DataNode - XX.XX.XX.17:50010:Got exception 
while serving blk_5964910456479636527 to /XX.XX.XX.17:
java.io.IOException: Block blk_5964910456479636527 is not valid.
        at org.apache.hadoop.dfs.FSDataset.getBlockFile(FSDataset.java:549)
        at org.apache.hadoop.dfs.FSDataset.getMetaFile(FSDataset.java:466)
        at org.apache.hadoop.dfs.FSDataset.metaFileExists(FSDataset.java:470)
        at org.apache.hadoop.dfs.DataNode$BlockSender.<init>(DataNode.java:1281)
        at 
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:897)
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
        at java.lang.Thread.run(Thread.java:619)

2007-11-28 21:38:39,397 ERROR dfs.DataNode - XX.XX.XX.17:50010:DataXceiver: 
java.io.IOException: Block blk_5964910456479636527 is not valid.
        at org.apache.hadoop.dfs.FSDataset.getBlockFile(FSDataset.java:549)
        at org.apache.hadoop.dfs.FSDataset.getMetaFile(FSDataset.java:466)
        at org.apache.hadoop.dfs.FSDataset.metaFileExists(FSDataset.java:470)
        at org.apache.hadoop.dfs.DataNode$BlockSender.<init>(DataNode.java:1281)
        at 
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:897)
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
        at java.lang.Thread.run(Thread.java:619)


How should I recover from this (and from the AlreadyBeingCreatedException with no matching namenode complaint in the below)?

Thanks,
St.Ack



stack wrote:
Reading down through the dfs code, at least the AlreadyBeingCreatedException should be getting retried. Why am I seeing it then in my code (And what to do about the below IOE that has no matching complaint in the namenode)?

Thanks,
St.Ack

stack wrote:
Here's more:

2007-11-27 05:00:26,331 DEBUG fs.DFSClient - Failed to connect to /XX.XX.XX.30:50010:java.io.IOException: Got error in response to OP_READ_BLOCK at org.apache.hadoop.dfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:753) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:979) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1075)
    at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at java.io.DataInputStream.readFully(DataInputStream.java:152)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
    at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263)
    at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242)
at org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.<init>(HStoreFile.java:820)
    at org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:932)
at org.apache.hadoop.hbase.HStore.compactHStoreFiles(HStore.java:1084)
    at org.apache.hadoop.hbase.HStore.compact(HStore.java:1017)
    at org.apache.hadoop.hbase.HRegion.compactStores(HRegion.java:745)
    at org.apache.hadoop.hbase.HRegion.compactIfNeeded(HRegion.java:704)
at org.apache.hadoop.hbase.HRegionServer$Compactor.run(HRegionServer.java:378)


Am I supposed to retry?
Will that even make a difference? Every ten minutes or so, client will retry getting a reader on either same file or another and gets same error.

Thanks for any insight,
St.Ack


stack wrote:
Anyone have insight on the following message from a near-TRUNK namenode log?

2007-11-26 01:16:23,282 WARN dfs.StateChange - DIR* NameSystem.startFile: failed to create file /hbase/hregion_-1194436719/oldlogfile.log for DFSClient_610028837 on client 38.99.77.80 because current leaseholder is trying to recreate file.

It starts for no apparent reason. There is no exception or warning preceding it in the log, and creation of files in dfs by pertinent code has been working fine for a good while before this. After it starts, it repeats every minute until the client-side goes away. While the client is up, it retries creating the file every 5 minutes or so.

From the client-side, the exception looks like this:

34974 2007-11-26 01:21:23,999 WARN hbase.HMaster - Processing pending operations: ProcessServerShutdown of XX.XX.XX.XX:60020 34975 org.apache.hadoop.dfs.AlreadyBeingCreatedException: org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /hbase/hregion_-1194436719/oldlogfile.log for DFSClient_610028837 on client XX.XX.XX.XX because curren t leaseholder is trying to recreate file. 34976 at org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:848) 34977 at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:804)
34978 at org.apache.hadoop.dfs.NameNode.create(NameNode.java:276)
34979 at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
34980 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
34981 at java.lang.reflect.Method.invoke(Method.java:597)
34982 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
34983 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
34984
34985 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 34986 at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) 34987 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) 34988 at java.lang.reflect.Constructor.newInstance(Constructor.java:513) 34989 at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
34990 at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1094)

The code is not exotic:

SequenceFile.Writer w = SequenceFile.createWriter(fs, conf, logfile, HLogKey.class, HLogEdit.class);

(See HLog from hbase around #180 if you want to see more context).

I see the comments in the incomplete HADOOP-2050. There is some implication that an attempt at a create should be wrapped in a try/catch that retries any AlreadyBeingCreatedException. Should I be doing this (It looks like its done in DFSClient)?

Thanks for any insight,
St.Ack

P.S. HADOOP-2283 is issue I've opened to cover this particular topic.






Reply via email to