Hi, Ted
Yes, there is a regionserver running on 10.147.1.168, following are the log
snippet for it. But actually this regionserver did nothing around
2014-02-18 01:17:31 ...and it detected error around 01:18:53
{code}
218 [2014-02-18 01:08:40,123][REPL IPC Server handler 0 on 8120][INFO
][org.apache.hadoop.ipc.HBaseServer]: REPL IPC Server handler 0 on 8120:
starting
219 [2014-02-18 01:08:40,124][REPL IPC Server handler 1 on 8120][INFO
][org.apache.hadoop.ipc.HBaseServer]: REPL IPC Server handler 1 on 8120:
starting
220 [2014-02-18 01:08:40,126][REPL IPC Server handler 2 on 8120][INFO
][org.apache.hadoop.ipc.HBaseServer]: REPL IPC Server handler 2 on 8120:
starting
221 [2014-02-18 01:08:40,128][regionserver8120][INFO
][org.apache.hadoop.hbase.regionserver.HRegionServer]: Serving as
aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776, RPC
listening o n aws-scottm-tmh6-1.ap-southeast-1.compute.internal/
10.147.1.168:8120, sessionid=0x14443974ce70009
222 [2014-02-18 01:08:40,130][regionserver8120][INFO
][org.apache.hadoop.hbase.regionserver.HRegionServer]: Registered
RegionServer MXBean
223 [2014-02-18
01:08:40,142][SplitLogWorker-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776][INFO
][org.apache.hadoop.hbase.regionserver.SplitLogWorker]: SplitLogWorker
aws-scottm-tm h6-1.ap-southeast-1.compute.internal,8120,1392703714776
starting
224 [2014-02-18 01:18:52,027][PRI IPC Server handler 0 on 8120][INFO
][org.apache.hadoop.hbase.regionserver.HRegionServer]: Received request to
open region: usertable,,1392704274325.e8be76322f79998d58750d
87f268e0ae.
225 [2014-02-18
01:18:52,551][RS_OPEN_REGION-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776-0][INFO
][org.apache.hadoop.hbase.regionserver.HRegion]: Setting up tabledescriptor
config now ...
226 [2014-02-18
01:18:52,584][RS_OPEN_REGION-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776-0][INFO
][org.apache.hadoop.hbase.coprocessor.CoprocessorHost]: System coprocessor
org.apa che.hadoop.hbase.security.token.TokenProvider was loaded
successfully with priority (536870911).
227 [2014-02-18
01:18:52,591][RS_OPEN_REGION-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776-0][INFO
][org.apache.hadoop.hbase.coprocessor.CoprocessorHost]: System coprocessor
org.apa che.hadoop.hbase.security.access.AccessController was loaded
successfully with priority (536870912).
228 [2014-02-18
01:18:52,946][StoreOpenerThread-usertable,,1392704274325.e8be76322f79998d58750d87f268e0ae.-1][INFO
][org.apache.hadoop.hbase.regionserver.Store]: time to purge deletes set to
0ms in store family
229 [2014-02-18
01:18:52,974][StoreFileOpenerThread-family-1][ERROR][org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics]:
Inconsistent configuration. Previous configuration for using table name
in metrics: true, new configuration: false
230 [2014-02-18 01:18:52,983][StoreFileOpenerThread-family-1][INFO
][org.apache.hadoop.hbase.util.ChecksumType]: Checksum using
org.apache.hadoop.util.PureJavaCrc32
231 [2014-02-18 01:18:52,983][StoreFileOpenerThread-family-1][INFO
][org.apache.hadoop.hbase.util.ChecksumType]: Checksum can use
org.apache.hadoop.util.PureJavaCrc32C
232 [2014-02-18
01:18:53,305][RS_OPEN_REGION-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776-0][INFO
][org.apache.hadoop.hbase.regionserver.HRegion]: Onlined
usertable,,1392704274325. e8be76322f79998d58750d87f268e0ae.; next
sequenceid=343
...
241 [2014-02-18
01:18:53,644][PostOpenDeployTasks:e8be76322f79998d58750d87f268e0ae][INFO
][org.apache.hadoop.hbase.catalog.MetaEditor]: Updated row
usertable,,1392704274325.e8be76322f79998d58750d87f268e0a e. with
server=aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
242 [2014-02-18
01:18:53,644][PostOpenDeployTasks:e8be76322f79998d58750d87f268e0ae][INFO
][org.apache.hadoop.hbase.regionserver.HRegionServer]: Done with post open
deploy task for region=usertable,,139270
4274325.e8be76322f79998d58750d87f268e0ae., daughter=false
243 [2014-02-18 01:19:10,142][Thread-11][WARN
][org.apache.hadoop.hdfs.DFSClient]: DataStreamer Exception
244
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on
/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714
776/aws-scottm-tmh6-1.ap-southeast-1.compute.internal%2C8120%2C1392703714776.1392703718212
File does not exist. Holder DFSClient_NONMAPREDUCE_-2013292460_25 does not
have any open files.
245 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2419)
246 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2410)
247 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2203)
248 at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:480)
249 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
250 at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
251 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
252 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
253 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
254 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
255 at java.security.AccessController.doPrivileged(Native Method)
256 at javax.security.auth.Subject.doAs(Subject.java:396)
257 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
258 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
259
260 at org.apache.hadoop.ipc.Client.call(Client.java:1225)
261 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
262 at $Proxy11.addBlock(Unknown Source)
263 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
264 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
265 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
266 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
267 at java.lang.reflect.Method.invoke(Method.java:597)
268 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
269 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
270 at $Proxy14.addBlock(Unknown Source)
271 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1176)
272 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1029)
273 at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
{code}
Best regards
takeshi
2014-02-19 11:26 GMT+08:00 Ted Yu <[email protected]>:
> Was region server running on 10.147.1.168 ?
>
> If so, can you pastebin region server log around 2014-02-18 01:17:31 ?
>
> Thanks
>
>
> On Tue, Feb 18, 2014 at 9:17 PM, takeshi <[email protected]> wrote:
>
> > Hi, all
> >
> > We also encounter the issue on our POC cluster which seems like the issue
> > yo were talking about, we also still not find the root cause yet, but
> would
> > like to share some findings here which may help others' study.
> >
> > Environment:
> > We are using cdh-4.2.1 (hadoop-2.0.0, hbase-0.94.2) as our base distro.
> > with some bugfixes porting from other apache versions. running on 14
> > machines with OS CentOS-5.3.
> >
> > The error msg we found for the issue
> >
> > HBase master reports "Parent directory doesn't exist" error
> > {code:title=hbase-master.log}
> > 334 [2014-02-18
> >
> >
> 01:20:30,550][master-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8100,1392703701863.archivedHFileCleaner][WARN
> > ][org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache]: Snapshot
> > di rectory: hdfs://cluster1/user/SPN-hbase/.snapshot doesn't exist
> > 335 [2014-02-18 01:20:36,597][IPC Server handler 47 on
> > 8100][ERROR][org.apache.hadoop.hbase.master.HMaster]: Region server
> > ^@^@aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
> > reported a fatal error:
> > 336 ABORTING region server
> > aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776: IOE
> > in log roller
> > 337 Cause:
> > 338 java.io.IOException: Exception in createWriter
> > 339 at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66)
> > 340 at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:694)
> > 341 at
> > org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:631)
> > 342 at
> > org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
> > 343 at java.lang.Thread.run(Thread.java:662)
> > 344 Caused by: java.io.IOException: cannot get log writer
> > 345 at
> > org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:750)
> > 346 at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60)
> > 347 ... 4 more
> > 348 Caused by: java.io.IOException: java.io.FileNotFoundException: Parent
> > directory doesn't exist:
> >
> >
> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
> > 349 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726)
> > 350 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848)
> > 351 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770)
> > 352 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747)
> > 353 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418)
> > 354 at
> >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205)
> > 355 at
> >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068)
> > 356 at
> >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> > 357 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
> > 358 at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
> > 359 at
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
> > 360 at java.security.AccessController.doPrivileged(Native Method)
> > 361 at javax.security.auth.Subject.doAs(Subject.java:396)
> > 362 at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > 363 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> > 364
> > 365 at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173)
> > 366 at
> > org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:747)
> > 367 ... 5 more
> > {code}
> >
> > namenode mkdirs for "/<hbase-root>/.logs/<regionserver>" requested from
> > principal "hbase/<fqdn>@LAB" , but remove the "/<hbase-root>/.logs" later
> > by requested from "hdfs/<fqdn>@LAB" !
> > {code:title=namenode.log (with hdfs audit enabled)}
> > 624 2014-02-18 01:08:38,203 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
> > ugi=hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB
> > (auth:KERBEROS)
> > ip=/
> > 10.147.1.168 cmd=mkdirs
> >
> >
> src=/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776
> > dst=null perm=hbase:supergroup:rwxr-xr-x
> > ...
> > 1068 2014-02-18 01:17:31,189 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
> > ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB
> > (auth:KERBEROS)
> > ip=/
> > 10.147.1.168 cmd=delete src=/user/SPN-hbase/.logs
> > dst=null perm=null
> > ...
> > 1343 2014-02-18 01:19:10,132 ERROR
> > org.apache.hadoop.security.UserGroupInformation:
> PriviledgedActionException
> > as:hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB
> > (auth:KERBEROS)
> > cause:org.apache
> > .hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
> >
> >
> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.inter
> > nal%2C8120%2C1392703714776.1392703718212 File does not exist. Holder
> > DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
> > 1344 2014-02-18 01:19:10,132 INFO org.apache.hadoop.ipc.Server: IPC
> Server
> > handler 83 on 8020, call
> > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
> > 10.147.1.168:35703: error: org.apache.hadoop.
> > hdfs.server.namenode.LeaseExpiredException: No lease on
> >
> >
> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.internal%2C81
> > 20%2C1392703714776.1392703718212 File does not exist. Holder
> > DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files.
> > 1345 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No
> lease
> > on
> >
> >
> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/
> > aws-scottm-tmh6-1.ap-southeast-1.com
> > pute.internal%2C8120%2C1392703714776.1392703718212 File does not exist.
> > Holder DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open
> files.
> > 1346 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2419)
> > 1347 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2410)
> > 1348 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2203)
> > 1349 at
> >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:480)
> > 1350 at
> >
> >
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
> > 1351 at
> >
> >
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
> > 1352 at
> >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
> > 1353 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
> > 1354 at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695)
> > 1355 at
> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691)
> > 1356 at java.security.AccessController.doPrivileged(Native
> Method)
> > 1357 at javax.security.auth.Subject.doAs(Subject.java:396)
> > 1358 at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
> > 1359 at
> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689)
> > {code}
> >
> > Way to reproduce:
> > Currently, we have a way to reproduce this issue, which is running ycsb
> > workload[a-f] to HBase, and the other hand, we simply to execute the
> > service <hadoop,hbase-daemons> start. Then we keep to running 10 times
> in a
> > loop on both side for one round, it could usually be reproduced on first
> or
> > second round. But I am not sure whether it is related to the issue in the
> > beginning of this discussion thread talked about.
> >
> >
> > Follow-ups:
> > Now I am following the weird msg appeared in the namenode.log
> > {code}
> > 2014-02-18 01:17:31,189 INFO
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true
> > ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB
> > (auth:KERBEROS)
> > ip=/
> > 10.147.1.168 cmd=delete src=/user/SPN-hbase/.logs
> > dst=null perm=null
> > {code}
> > To see what can I dig out more...
> >
> >
> > I'll keep to update this thread if any finding, and welcome any opinion,
> > tks
> >
> >
> > Best regards
> >
> > takeshi
> >
>