[jira] [Comment Edited] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed
[ https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052019#comment-17052019 ] Anes Mukhametov edited comment on HDFS-14498 at 3/5/20, 11:13 AM: -- Got the same issue with CDH-5.16.2 {{It also seems to happen after a client died while writing, more than a week ago. As a result LeaseManager seems stuck on this lease with no other leases being recovered.}} All logs already rolled, so I can't give any additional info. Also I'm unable to provide any jstack information, can't pause production namenode for such a long period of time (it takes more than 5 minutes to create stack dump). {quote}{{2020-03-05 13:29:33,846 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1], src=/tmp/57-1582557184-0.tmp}}{{2020-03-05 13:29:33,846 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally replicated. Try again later.}}{{2020-03-05 13:29:33,846 WARN org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path /tmp/57-1582557184-0.tmp in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1]. It will be retried.}}{{org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* NameSystem.internalReleaseLease: Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally replicated. Try again later.}}{{ at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:4889)}}{{ at org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:605)}}{{ at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:541)}}{{ at java.lang.Thread.run(Thread.java:748)}} {quote} was (Author: amuhametov): Got the same issue with CDH-5.16.2 {{It also seems to happen after a client died while writing, more than a week ago. As a result LockManager seems stuck on this lease with no other leases being recovered.}} All logs already rolled, so I can't give any additional info. Also I'm unable to provide any jstack information, can't pause production namenode for such a long period of time (it takes more than 5 minutes to create stack dump). {quote}{{2020-03-05 13:29:33,846 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1], src=/tmp/57-1582557184-0.tmp}}{{2020-03-05 13:29:33,846 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally replicated. Try again later.}}{{2020-03-05 13:29:33,846 WARN org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path /tmp/57-1582557184-0.tmp in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1]. It will be retried.}}{{org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* NameSystem.internalReleaseLease: Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally replicated. Try again later.}}{{ at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:4889)}}{{ at org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:605)}}{{ at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:541)}}{{ at java.lang.Thread.run(Thread.java:748)}} {quote} > LeaseManager can loop forever on the file for which create has failed > -- > > Key: HDFS-14498 > URL: https://issues.apache.org/jira/browse/HDFS-14498 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Sergey Shelukhin >Priority: Major > > The logs from file creation are long gone due to infinite lease logging, > however it presumably failed... the client who was trying to write this file > is definitely long dead. > The version includes HDFS-4882. > We get this log pattern repeating infinitely: > {noformat} > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: > DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard > limit > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.FS
[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed
[ https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052019#comment-17052019 ] Anes Mukhametov commented on HDFS-14498: Got the same issue with CDH-5.16.2 {{It also seems to happen after a client died while writing, more than a week ago. As a result LockManager seems stuck on this lease with no other leases being recovered.}} All logs already rolled, so I can't give any additional info. Also I'm unable to provide any jstack information, can't pause production namenode for such a long period of time (it takes more than 5 minutes to create stack dump). {quote}{{2020-03-05 13:29:33,846 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1], src=/tmp/57-1582557184-0.tmp}}{{2020-03-05 13:29:33,846 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally replicated. Try again later.}}{{2020-03-05 13:29:33,846 WARN org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path /tmp/57-1582557184-0.tmp in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1]. It will be retried.}}{{org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* NameSystem.internalReleaseLease: Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally replicated. Try again later.}}{{ at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:4889)}}{{ at org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:605)}}{{ at org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:541)}}{{ at java.lang.Thread.run(Thread.java:748)}} {quote} > LeaseManager can loop forever on the file for which create has failed > -- > > Key: HDFS-14498 > URL: https://issues.apache.org/jira/browse/HDFS-14498 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.0 >Reporter: Sergey Shelukhin >Priority: Major > > The logs from file creation are long gone due to infinite lease logging, > however it presumably failed... the client who was trying to write this file > is definitely long dead. > The version includes HDFS-4882. > We get this log pattern repeating infinitely: > {noformat} > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder: > DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard > limit > 2019-05-16 14:00:16,893 INFO > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src= > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: > Failed to release lease for file . Committed blocks are waiting to be > minimally replicated. Try again later. > 2019-05-16 14:00:16,893 WARN > [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path > in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61, > pending creates: 1]. It will be retried. > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* > NameSystem.internalReleaseLease: Failed to release lease for file . > Committed blocks are waiting to be minimally replicated. Try again later. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573) > at > org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509) > at java.lang.Thread.run(Thread.java:745) > $ grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: > 1" hdfs_nn* > hdfs_nn.log:1068035 > hdfs_nn.log.2019-05-16-14:1516179 > hdfs_nn.log.2019-05-16-15:1538350 > {noformat} > Aside from an actual bug fix, it might make sense to make LeaseManager not > log so much, in case if there are more bugs like this... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-i
[jira] [Issue Comment Deleted] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anes Mukhametov updated HDFS-7060: -- Comment: was deleted (was: got the same issue with cdh 5.16 (hadoop 2.6/2.7)) > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Jiandan Yang >Priority: Major > Labels: BB2015-05-TBR, locks, performance > Fix For: 3.0.0, 3.1.0 > > Attachments: HDFS Status Post Patch.png, HDFS-7060-002.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch, HDFS-7060.003.patch, > HDFS-7060.004.patch, HDFS-7060.005.patch, complete_failed_qps.png, > sendHeartbeat.png > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode
[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877257#comment-16877257 ] Anes Mukhametov commented on HDFS-7060: --- got the same issue with cdh 5.16 (hadoop 2.6/2.7) > Avoid taking locks when sending heartbeats from the DataNode > > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haohui Mai >Assignee: Jiandan Yang >Priority: Major > Labels: BB2015-05-TBR, locks, performance > Fix For: 3.0.0, 3.1.0 > > Attachments: HDFS Status Post Patch.png, HDFS-7060-002.patch, > HDFS-7060.000.patch, HDFS-7060.001.patch, HDFS-7060.003.patch, > HDFS-7060.004.patch, HDFS-7060.005.patch, complete_failed_qps.png, > sendHeartbeat.png > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) >java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org