The IOE in the below has a matching complaint in the data node. Here
are the matching client and datanode exceptions:
CLIENT:
2007-11-28 21:38:39,391 DEBUG hbase.HStore - started compaction of 3 files in
/hbase/compaction.dir/hregion_-1314266377/cookie
2007-11-28 21:38:39,398 DEBUG fs.DFSClient - Failed to connect to
/XX.XX.XX.17:50010:java.io.IOException: Got error in response to OP_READ_BLOCK
at
org.apache.hadoop.dfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:753)
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:979)
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1075)
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1027)
at java.io.FilterInputStream.read(FilterInputStream.java:66)
at java.io.DataInputStream.readByte(DataInputStream.java:248)
at org.apache.hadoop.hbase.HStoreFile.loadInfo(HStoreFile.java:590)
at org.apache.hadoop.hbase.HStore.compact(HStore.java:1007)
at org.apache.hadoop.hbase.HRegion.compactStores(HRegion.java:745)
at org.apache.hadoop.hbase.HRegion.compactIfNeeded(HRegion.java:704)
at
org.apache.hadoop.hbase.HRegionServer$Compactor.run(HRegionServer.java:378)
DATANODE:
2007-11-28 21:38:39,397 WARN dfs.DataNode - XX.XX.XX.17:50010:Got exception
while serving blk_5964910456479636527 to /XX.XX.XX.17:
java.io.IOException: Block blk_5964910456479636527 is not valid.
at org.apache.hadoop.dfs.FSDataset.getBlockFile(FSDataset.java:549)
at org.apache.hadoop.dfs.FSDataset.getMetaFile(FSDataset.java:466)
at org.apache.hadoop.dfs.FSDataset.metaFileExists(FSDataset.java:470)
at org.apache.hadoop.dfs.DataNode$BlockSender.<init>(DataNode.java:1281)
at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:897)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
at java.lang.Thread.run(Thread.java:619)
2007-11-28 21:38:39,397 ERROR dfs.DataNode - XX.XX.XX.17:50010:DataXceiver:
java.io.IOException: Block blk_5964910456479636527 is not valid.
at org.apache.hadoop.dfs.FSDataset.getBlockFile(FSDataset.java:549)
at org.apache.hadoop.dfs.FSDataset.getMetaFile(FSDataset.java:466)
at org.apache.hadoop.dfs.FSDataset.metaFileExists(FSDataset.java:470)
at org.apache.hadoop.dfs.DataNode$BlockSender.<init>(DataNode.java:1281)
at
org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:897)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
at java.lang.Thread.run(Thread.java:619)
How should I recover from this (and from the
AlreadyBeingCreatedException with no matching namenode complaint in the
below)?
Thanks,
St.Ack
stack wrote:
Reading down through the dfs code, at least the
AlreadyBeingCreatedException should be getting retried. Why am I
seeing it then in my code (And what to do about the below IOE that has
no matching complaint in the namenode)?
Thanks,
St.Ack
stack wrote:
Here's more:
2007-11-27 05:00:26,331 DEBUG fs.DFSClient - Failed to connect to
/XX.XX.XX.30:50010:java.io.IOException: Got error in response to
OP_READ_BLOCK
at
org.apache.hadoop.dfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:753)
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:979)
at
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1075)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readFully(DataInputStream.java:152)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1383)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1360)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1349)
at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1344)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263)
at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242)
at
org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.<init>(HStoreFile.java:820)
at org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:932)
at
org.apache.hadoop.hbase.HStore.compactHStoreFiles(HStore.java:1084)
at org.apache.hadoop.hbase.HStore.compact(HStore.java:1017)
at org.apache.hadoop.hbase.HRegion.compactStores(HRegion.java:745)
at org.apache.hadoop.hbase.HRegion.compactIfNeeded(HRegion.java:704)
at
org.apache.hadoop.hbase.HRegionServer$Compactor.run(HRegionServer.java:378)
Am I supposed to retry?
Will that even make a difference? Every ten minutes or so, client
will retry getting a reader on either same file or another and gets
same error.
Thanks for any insight,
St.Ack
stack wrote:
Anyone have insight on the following message from a near-TRUNK
namenode log?
2007-11-26 01:16:23,282 WARN dfs.StateChange - DIR*
NameSystem.startFile: failed to create file
/hbase/hregion_-1194436719/oldlogfile.log for DFSClient_610028837 on
client 38.99.77.80 because current leaseholder is trying to recreate
file.
It starts for no apparent reason. There is no exception or warning
preceding it in the log, and creation of files in dfs by pertinent
code has been working fine for a good while before this. After it
starts, it repeats every minute until the client-side goes away.
While the client is up, it retries creating the file every 5 minutes
or so.
From the client-side, the exception looks like this:
34974 2007-11-26 01:21:23,999 WARN hbase.HMaster - Processing
pending operations: ProcessServerShutdown of XX.XX.XX.XX:60020
34975 org.apache.hadoop.dfs.AlreadyBeingCreatedException:
org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create
file /hbase/hregion_-1194436719/oldlogfile.log for
DFSClient_610028837 on client XX.XX.XX.XX because curren t
leaseholder is trying to recreate file.
34976 at
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:848)
34977 at
org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:804)
34978 at org.apache.hadoop.dfs.NameNode.create(NameNode.java:276)
34979 at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
34980 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
34981 at java.lang.reflect.Method.invoke(Method.java:597)
34982 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
34983 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
34984
34985 at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
34986 at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
34987 at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
34988 at
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
34989 at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
34990 at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1094)
The code is not exotic:
SequenceFile.Writer w = SequenceFile.createWriter(fs,
conf, logfile, HLogKey.class, HLogEdit.class);
(See HLog from hbase around #180 if you want to see more context).
I see the comments in the incomplete HADOOP-2050. There is some
implication that an attempt at a create should be wrapped in a
try/catch that retries any AlreadyBeingCreatedException. Should I
be doing this (It looks like its done in DFSClient)?
Thanks for any insight,
St.Ack
P.S. HADOOP-2283 is issue I've opened to cover this particular topic.