[
https://issues.apache.org/jira/browse/HDFS-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506036#comment-13506036
]
Daryn Sharp commented on HDFS-4232:
-----------------------------------
Ok, I need some input here. I'm attacking the symptom of a problem encountered
on a production cluster that encountered a series of complex issues that we
can't piece together. Long story short, we wound up with leases to
non-existent files and the NN couldn't start. When believed the dangling
leases were the result of deletions but the cause is ambiguous.
My original intent was to make dangling leases non-fatal. It's arguably
reasonable because nothing will be lost - that's not already lost (the file).
However that masks whatever bug may have caused the issue in the first place.
# Looking at the code, the inodes are deleted, then the corresponding leases
are deleted, and then the edit log is synced outside the write lock so perhaps
it's plausible that the edits were not atomically written? I don't think I can
write a test that might induce that possible bug.
# The other possibility is that {{LeaseManager#removeLeaseWithPathPrefix}} is
malfunctioning. It takes a {{SortedSet}}, makes another {{SortedSet}} backed
by the former set, makes a {{List}} of {{Map.Entry}} from the set of a set,
then iterates and the list and performs operations that will affect the
original set. Now the javadocs for {{Map.Entry}} state:
bq. Map.Entry objects are valid only for the duration of the iteration; more
formally, the behavior of a map entry is undefined if the backing map has been
modified after the entry was returned by the iterator, except through the
setValue operation on the map entry.
It sounds like #2 is relying on undefined behavior twofold: using the entries
after the iteration is complete, and using the entries while modifying their
set. I can easily fix that but I can't write a test that can reliably induce
the former undefined behavior to occur.
So I'm at the crossroads of:
# As a safety measure, make dangling leases non-fatal
# Fix {{removeLeaseWithPrefix}} to avoid undefined behavior, and hope it solves
the problem
# 1+2
Advice?
> NN fails to write a fsimage with stale leases
> ---------------------------------------------
>
> Key: HDFS-4232
> URL: https://issues.apache.org/jira/browse/HDFS-4232
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 0.23.0
> Reporter: Daryn Sharp
> Assignee: Daryn Sharp
> Priority: Blocker
>
> The reading of a fsimage will ignore leases for non-existent files, but the
> writing of an image will fail if there are leases for non-existent files. If
> the image contains leases that reference a non-existent file, then the NN
> will fail to start, and the 2NN will start but fail to ever write an image.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira