[ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424695#comment-16424695
 ] 

Appy commented on HBASE-20330:
------------------------------

Hard to reproduce problem.
 Our hypothesis was, if a master during startup tries to create new proc-wal, 
and fails while writing header (for 19.log in this case) 
[here|https://github.com/apache/hbase/blob/d9e64aa6b83fb6ed5230b0fde86fdf8d8732e1a4/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L1041],
 then it'll get stuck in infinite loop.
 Exact seq of steps.
 # rollWriter fails to write header (for 19.log), returns false.
 # we 'continue' the loop 
[here|https://github.com/apache/hbase/blob/d9e64aa6b83fb6ed5230b0fde86fdf8d8732e1a4/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/store/wal/WALProcedureStore.java#L377]
 # isRunning() will always be true, list of oldLogs is not refreshed (it's 
still till 18.log)
 # renew leases on all existing log files till 18, try to create 19
 # since it was already created last time, rollWriter returns false and we 
'continue' from step 2 again.

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---------------------------------------------------------------
>
>                 Key: HBASE-20330
>                 URL: https://issues.apache.org/jira/browse/HBASE-20330
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.0.0-beta-2
>            Reporter: Umesh Agashe
>            Assignee: Umesh Agashe
>            Priority: Major
>             Fix For: 2.0.0
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-00000000000000000019.log for client 10.17.202.11 
> already exists
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy] and [~avirmani], we found that when 
> WALProcedureStore#rollWriter() fails and returns false for some reason, it 
> keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to