[
https://issues.apache.org/jira/browse/HDFS-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941211#comment-16941211
]
leigh edited comment on HDFS-14498 at 10/1/19 7:59 AM:
-------------------------------------------------------
Hi,
We also encountered this issue today:
Hadoop 3.2.0
Source code repository [https://github.com/apache/hadoop.git] -r
e97acb3bd8f3befd27418996fa5d4b50bf2e17bf
Compiled by sunilg on 2019-01-08T06:08Z
Compiled with protoc 2.5.0
>From source with checksum d3f0795ed0d9dc378e2c785d3668f39
We did have an issue in our cluster before this happened where some of our DN
stopped sharing heartbeats with the active NN (although heartbeats with the
standby NN's could be seen).
I ran the recoverLease command on the bad files but that did not help.
I restarted the NN's and all the DN's. It stopped the spamming of the logs but
we were still unable to write to the file. In the end I had to delete the bad
files.
We have a sizeable cluster. Is there anything in particular you would like to
see from the logs?
Thanks in advance.
was (Author: [email protected]):
Hi,
We also encountered this issue today:
Hadoop 3.2.0
Source code repository https://github.com/apache/hadoop.git -r
e97acb3bd8f3befd27418996fa5d4b50bf2e17bf
Compiled by sunilg on 2019-01-08T06:08Z
Compiled with protoc 2.5.0
>From source with checksum d3f0795ed0d9dc378e2c785d3668f39
We did have an issue in our cluster before this happened where some of our DN
stopped receiving heartbeats from the active NN (although heartbeats from the
standby NN's could be seen).
I ran the recoverLease command on the bad files but that did not help.
I restarted the NN's and all the DN's. It stopped the spamming of the logs but
we were still unable to write to the file. In the end I had to delete the bad
files.
We have a sizeable cluster. Is there anything in particular you would like to
see from the logs?
Thanks in advance.
> LeaseManager can loop forever on the file for which create has failed
> ----------------------------------------------------------------------
>
> Key: HDFS-14498
> URL: https://issues.apache.org/jira/browse/HDFS-14498
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.9.0
> Reporter: Sergey Shelukhin
> Priority: Major
>
> The logs from file creation are long gone due to infinite lease logging,
> however it presumably failed... the client who was trying to write this file
> is definitely long dead.
> The version includes HDFS-4882.
> We get this log pattern repeating infinitely:
> {noformat}
> 2019-05-16 14:00:16,893 INFO
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f]
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: [Lease. Holder:
> DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1] has expired hard
> limit
> 2019-05-16 14:00:16,893 INFO
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f]
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_NONMAPREDUCE_-20898906_61, pending creates: 1], src=<snip>
> 2019-05-16 14:00:16,893 WARN
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f]
> org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease:
> Failed to release lease for file <snip>. Committed blocks are waiting to be
> minimally replicated. Try again later.
> 2019-05-16 14:00:16,893 WARN
> [org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@b27557f]
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Cannot release the path
> <snip> in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-20898906_61,
> pending creates: 1]. It will be retried.
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR*
> NameSystem.internalReleaseLease: Failed to release lease for file <snip>.
> Committed blocks are waiting to be minimally replicated. Try again later.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.internalReleaseLease(FSNamesystem.java:3357)
> at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.checkLeases(LeaseManager.java:573)
> at
> org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor.run(LeaseManager.java:509)
> at java.lang.Thread.run(Thread.java:745)
> $ grep -c "Recovering.*DFSClient_NONMAPREDUCE_-20898906_61, pending creates:
> 1" hdfs_nn*
> hdfs_nn.log:1068035
> hdfs_nn.log.2019-05-16-14:1516179
> hdfs_nn.log.2019-05-16-15:1538350
> {noformat}
> Aside from an actual bug fix, it might make sense to make LeaseManager not
> log so much, in case if there are more bugs like this...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]