[ 
https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764706#comment-16764706
 ] 

Bahram Chehrazy commented on HBASE-21844:
-----------------------------------------

Also, related to the WAL corruption, I've found several errors like this. This 
happened after some transient HDFS issues which lead to several servers 
crashing. It seems that there is a race condition between two servers trying to 
create the same log file:

 

2019-02-06 07:58:30,943 WARN [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: *failed to create log file with id=116*
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
 Failed to CREATE_FILE 
/wdphbase/bachehra_Hbase-ingestion-test_30RS_300R-100k-compaction-2/MasterProcWALs/pv2-00000000000000000116.log
 for DFSClient_NONMAPREDUCE_1940839211_1 on *25.121.235.241* because this file 
lease is currently owned by DFSClient_NONMAPREDUCE_1385418312_1 on 
*25.121.241.103*
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2583)
 at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:357)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2431)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileUncheckedMount(FSNamesystem.java:2352)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:748)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:421)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2621)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1507)
 at org.apache.hadoop.ipc.Client.call(Client.java:1453)
 at org.apache.hadoop.ipc.Client.call(Client.java:1363)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
 at com.sun.proxy.$Proxy19.create(Unknown Source)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297)
 at sun.reflect.GeneratedMethodAccessor99.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
 at com.sun.proxy.$Proxy20.create(Unknown Source)
 at sun.reflect.GeneratedMethodAccessor99.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372)
 at com.sun.proxy.$Proxy21.create(Unknown Source)
 at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:267)
 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1246)
 at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1188)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:480)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:477)
 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:491)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:418)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1067)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1048)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:937)
 at 
org.apache.hadoop.hbase.util.CommonFSUtils$DfsBuilderUtility.createHelper(CommonFSUtils.java:952)
 at 
org.apache.hadoop.hbase.util.CommonFSUtils.createForWal(CommonFSUtils.java:964)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.rollWriter(WALProcedureStore.java:1074)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.rollWriter(WALProcedureStore.java:1040)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.rollWriterWithRetries(WALProcedureStore.java:952)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.syncSlots(WALProcedureStore.java:908)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.syncLoop(WALProcedureStore.java:859)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.access$000(WALProcedureStore.java:111)
 at 
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore$1.run(WALProcedureStore.java:314)


2019-02-06 07:58:30,943 WARN [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: *someone else has already created log 115*

> Master could get stuck in initializing state while waiting for meta
> -------------------------------------------------------------------
>
>                 Key: HBASE-21844
>                 URL: https://issues.apache.org/jira/browse/HBASE-21844
>             Project: HBase
>          Issue Type: Bug
>          Components: master, meta
>    Affects Versions: 3.0.0
>            Reporter: Bahram Chehrazy
>            Assignee: Bahram Chehrazy
>            Priority: Major
>         Attachments: 
> 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance 
> of master getting into a state where the ZK says meta is OPEN, but the server 
> is dead and there is no active SCP to recover it (perhaps the SCP has aborted 
> and the procWALs were corrupted). In this case the waitForMetaOnline never 
> returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS 
> outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/************:16000:becomeActiveMaster] 
> master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, 
> server=*************,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in 
> holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, 
> but nevertheless the master should be able to recover during a restart by 
> initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to