[jira] [Comment Edited] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-03-05 Thread Anes Mukhametov (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052019#comment-17052019
 ] 

Anes Mukhametov edited comment on HDFS-14498 at 3/5/20, 11:13 AM:
--

Got the same issue with CDH-5.16.2

{{It also seems to happen after a client died while writing, more than a week 
ago. As a result LeaseManager seems stuck on this lease with no other leases 
being recovered.}}

 

All logs already rolled, so I can't give any additional info. Also I'm unable 
to provide any jstack information, can't pause production namenode for such a 
long period of time (it takes more than 5 minutes to create stack dump).  

 
{quote}{{2020-03-05 13:29:33,846 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1], 
src=/tmp/57-1582557184-0.tmp}}{{2020-03-05 13:29:33,846 WARN 
org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are 
waiting to be minimally replicated. Try again later.}}{{2020-03-05 13:29:33,846 
WARN org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the 
path /tmp/57-1582557184-0.tmp in the lease [Lease.  Holder: 
DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1]. It will be 
retried.}}{{org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
NameSystem.internalReleaseLease: Failed to release lease for file 
/tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally 
replicated. Try again later.}}{{        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:4889)}}{{
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:605)}}{{
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:541)}}{{
        at java.lang.Thread.run(Thread.java:748)}}
{quote}
 

 


was (Author: amuhametov):
Got the same issue with CDH-5.16.2

{{It also seems to happen after a client died while writing, more than a week 
ago. As a result LockManager seems stuck on this lease with no other leases 
being recovered.}}

 

All logs already rolled, so I can't give any additional info. Also I'm unable 
to provide any jstack information, can't pause production namenode for such a 
long period of time (it takes more than 5 minutes to create stack dump).  

 
{quote}{{2020-03-05 13:29:33,846 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1], 
src=/tmp/57-1582557184-0.tmp}}{{2020-03-05 13:29:33,846 WARN 
org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are 
waiting to be minimally replicated. Try again later.}}{{2020-03-05 13:29:33,846 
WARN org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the 
path /tmp/57-1582557184-0.tmp in the lease [Lease.  Holder: 
DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1]. It will be 
retried.}}{{org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
NameSystem.internalReleaseLease: Failed to release lease for file 
/tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally 
replicated. Try again later.}}{{        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:4889)}}{{
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:605)}}{{
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:541)}}{{
        at java.lang.Thread.run(Thread.java:748)}}
{quote}
 

 

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Priority: Major
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FS

[jira] [Commented] (HDFS-14498) LeaseManager can loop forever on the file for which create has failed

2020-03-05 Thread Anes Mukhametov (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052019#comment-17052019
 ] 

Anes Mukhametov commented on HDFS-14498:


Got the same issue with CDH-5.16.2

{{It also seems to happen after a client died while writing, more than a week 
ago. As a result LockManager seems stuck on this lease with no other leases 
being recovered.}}

 

All logs already rolled, so I can't give any additional info. Also I'm unable 
to provide any jstack information, can't pause production namenode for such a 
long period of time (it takes more than 5 minutes to create stack dump).  

 
{quote}{{2020-03-05 13:29:33,846 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
Holder: DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1], 
src=/tmp/57-1582557184-0.tmp}}{{2020-03-05 13:29:33,846 WARN 
org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
Failed to release lease for file /tmp/57-1582557184-0.tmp. Committed blocks are 
waiting to be minimally replicated. Try again later.}}{{2020-03-05 13:29:33,846 
WARN org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the 
path /tmp/57-1582557184-0.tmp in the lease [Lease.  Holder: 
DFSClient_NONMAPREDUCE_-1052192603_27, pending creates: 1]. It will be 
retried.}}{{org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
NameSystem.internalReleaseLease: Failed to release lease for file 
/tmp/57-1582557184-0.tmp. Committed blocks are waiting to be minimally 
replicated. Try again later.}}{{        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:4889)}}{{
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:605)}}{{
        at 
org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:541)}}{{
        at java.lang.Thread.run(Thread.java:748)}}
{quote}
 

 

> LeaseManager can loop forever on the file for which create has failed 
> --
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.0
>Reporter: Sergey Shelukhin
>Priority: Major
>
> The logs from file creation are long gone due to infinite lease logging, 
> however it presumably failed... the client who was trying to write this file 
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard 
> limit
> 2019-05-16 14:00:16,893 INFO 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease: 
> Failed to release lease for file . Committed blocks are waiting to be 
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN 
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f] 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path 
>  in the lease [Lease.  Holder: DFSClient_NONMAPREDUCE_-20898906_61, 
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* 
> NameSystem.internalReleaseLease: Failed to release lease for file . 
> Committed blocks are waiting to be minimally replicated. Try again later.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
>   at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
>   at java.lang.Thread.run(Thread.java:745)
> $  grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not 
> log so much, in case if there are more bugs like this...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-i

[jira] [Issue Comment Deleted] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2019-07-02 Thread Anes Mukhametov (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anes Mukhametov updated HDFS-7060:
--
Comment: was deleted

(was: got the same issue with cdh 5.16 (hadoop 2.6/2.7))

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Jiandan Yang 
>Priority: Major
>  Labels: BB2015-05-TBR, locks, performance
> Fix For: 3.0.0, 3.1.0
>
> Attachments: HDFS Status Post Patch.png, HDFS-7060-002.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch, HDFS-7060.003.patch, 
> HDFS-7060.004.patch, HDFS-7060.005.patch, complete_failed_qps.png, 
> sendHeartbeat.png
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7060) Avoid taking locks when sending heartbeats from the DataNode

2019-07-02 Thread Anes Mukhametov (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16877257#comment-16877257
 ] 

Anes Mukhametov commented on HDFS-7060:
---

got the same issue with cdh 5.16 (hadoop 2.6/2.7)

> Avoid taking locks when sending heartbeats from the DataNode
> 
>
> Key: HDFS-7060
> URL: https://issues.apache.org/jira/browse/HDFS-7060
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haohui Mai
>Assignee: Jiandan Yang 
>Priority: Major
>  Labels: BB2015-05-TBR, locks, performance
> Fix For: 3.0.0, 3.1.0
>
> Attachments: HDFS Status Post Patch.png, HDFS-7060-002.patch, 
> HDFS-7060.000.patch, HDFS-7060.001.patch, HDFS-7060.003.patch, 
> HDFS-7060.004.patch, HDFS-7060.005.patch, complete_failed_qps.png, 
> sendHeartbeat.png
>
>
> We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} 
> when the DN is under heavy load of writes:
> {noformat}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91)
> - locked <0x000780612fd8> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743)
> - waiting to lock <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
>java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753)
> - locked <0x000780304fb8> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:169)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
> at java.lang.Thread.run(Thread.java:744)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org