Hi, all
We also encounter the issue on our POC cluster which seems like the issue
yo were talking about, we also still not find the root cause yet, but would
like to share some findings here which may help others' study.
Environment:
We are using cdh-4.2.1 (hadoop-2.0.0, hbase-0.94.2) as our base distro.
with some bugfixes porting from other apache versions. running on 14
machines with OS CentOS-5.3.
The error msg we found for the issue
HBase master reports "Parent directory doesn't exist" error
{code:title=hbase-master.log}
334 [2014-02-18
01:20:30,550][master-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8100,1392703701863.archivedHFileCleaner][WARN
][org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache]: Snapshot
di rectory: hdfs://cluster1/user/SPN-hbase/.snapshot doesn't exist
335 [2014-02-18 01:20:36,597][IPC Server handler 47 on
8100][ERROR][org.apache.hadoop.hbase.master.HMaster]: Region server
^@^@aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
reported a fatal error:
336 ABORTING region server
aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776: IOE
in log roller
337 Cause:
338 java.io.IOException: Exception in createWriter
339 at
org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66)
340 at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:694)
341 at
org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:631)
342 at
org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
343 at java.lang.Thread.run(Thread.java:662)
344 Caused by: java.io.IOException: cannot get log writer
345 at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:750)
346 at
org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60)
347 ... 4 more
348 Caused by: java.io.IOException: java.io.FileNotFoundException: Parent
directory doesn't exist:
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
349 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726)
350 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848)
351 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770)
352 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747)
353 at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418)
354 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205)
355 at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068)
356 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
357 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
358 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
359 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
360 at java.security.AccessController.doPrivileged(Native Method)
361 at javax.security.auth.Subject.doAs(Subject.java:396)
362 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
363 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
364
365 at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173)
366 at
org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:747)
367 ... 5 more
{code}
namenode mkdirs for "/<hbase-root>/.logs/<regionserver>" requested from
principal "hbase/<fqdn>@LAB" , but remove the "/<hbase-root>/.logs" later
by requested from "hdfs/<fqdn>@LAB" !
{code:title=namenode.log (with hdfs audit enabled)}
624 2014-02-18 01:08:38,203 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
ugi=hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB(auth:KERBEROS)
ip=/
10.147.1.168 cmd=mkdirs
src=/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
dst=null perm=hbase:supergroup:rwxr-xr-x
...
1068 2014-02-18 01:17:31,189 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB(auth:KERBEROS)
ip=/
10.147.1.168 cmd=delete src=/user/SPN-hbase/.logs
dst=null perm=null
...
1343 2014-02-18 01:19:10,132 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB(auth:KERBEROS)
cause:org.apache
.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.inter
nal%2C8120%2C1392703714776.1392703718212 File does not exist. Holder
DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
1344 2014-02-18 01:19:10,132 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 83 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
10.147.1.168:35703: error: org.apache.hadoop.
hdfs.server.namenode.LeaseExpiredException: No lease on
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.internal%2C81
20%2C1392703714776.1392703718212 File does not exist. Holder
DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
1345 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/
aws-scottm-tmh6-1.ap-southeast-1.com
pute.internal%2C8120%2C1392703714776.1392703718212 File does not exist.
Holder DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
1346 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2419)
1347 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2410)
1348 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2203)
1349 at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:480)
1350 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
1351 at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
1352 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
1353 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
1354 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
1355 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
1356 at java.security.AccessController.doPrivileged(Native Method)
1357 at javax.security.auth.Subject.doAs(Subject.java:396)
1358 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
1359 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
{code}
Way to reproduce:
Currently, we have a way to reproduce this issue, which is running ycsb
workload[a-f] to HBase, and the other hand, we simply to execute the
service <hadoop,hbase-daemons> start. Then we keep to running 10 times in a
loop on both side for one round, it could usually be reproduced on first or
second round. But I am not sure whether it is related to the issue in the
beginning of this discussion thread talked about.
Follow-ups:
Now I am following the weird msg appeared in the namenode.log
{code}
2014-02-18 01:17:31,189 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB(auth:KERBEROS)
ip=/
10.147.1.168 cmd=delete src=/user/SPN-hbase/.logs
dst=null perm=null
{code}
To see what can I dig out more...
I'll keep to update this thread if any finding, and welcome any opinion, tks
Best regards
takeshi