[jira] [Commented] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

Umesh Agashe (JIRA) Fri, 06 Apr 2018 14:22:35 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429024#comment-16429024
 ]


Umesh Agashe commented on HBASE-20330:
--------------------------------------

bq. The preferred idiom here is to use a final AtomicBoolean as a holder 
instead of a boolean array. It's a little bit more explicit about what is 
happening.
Intellij suggests using an array and IIRC existing code has used arrays at 
places. If you can add more context to preferred idiom I will be happy to 
change it. I think AtomicBoolean when atomicity is not required can be more 
confusing.

bq. Was the rollWriterForTesting method that you found insufficient for your 
needs? confused about the difference and why we can't use that.

recoverLease method calls rollWriter() and when it fails it was continuously 
looping. Stubbing rollWriterForTesting will not help here.

> ProcedureExecutor.start() gets stuck in recover lease on store.
> ---------------------------------------------------------------
>
>                 Key: HBASE-20330
>                 URL: https://issues.apache.org/jira/browse/HBASE-20330
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.0.0-beta-2
>            Reporter: Umesh Agashe
>            Assignee: Umesh Agashe
>            Priority: Major
>             Fix For: 2.0.0
>
>         Attachments: hbase-20330.master.001.patch, 
> hbase-20330.master.002.patch, hbase-20330.master.003.patch, 
> hbase-20330.master.004.patch
>
>
> We have instance in our internal testing where master log is getting filled 
> with following messages:
> {code}
> 2018-04-02 17:11:17,566 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recover lease on dfs file 
> hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log
> 2018-04-02 17:11:17,567 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovered lease, attempt=0 on 
> file=hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log after 1ms
> 2018-04-02 17:11:17,574 WARN 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Unable to 
> read tracker for hdfs://ns1/hbase/MasterProcWALs/pv2-00000000000000000018.log 
> - Invalid Trailer version. got 111 expected 1
> 2018-04-02 17:11:17,576 ERROR 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore: Log file with 
> id=19 already exists
> org.apache.hadoop.fs.FileAlreadyExistsException: 
> /hbase/MasterProcWALs/pv2-00000000000000000019.log for client 10.17.202.11 
> already exists
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:381)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2442)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> {code}
> Debugging it further with [~appy], [~avirmani] and [~xiaochen] we found that 
> when WALProcedureStore#rollWriter() fails and returns false for some reason, 
> it keeps looping continuously.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-20330) ProcedureExecutor.start() gets stuck in recover lease on store.

Reply via email to