[ 
https://issues.apache.org/jira/browse/HDFS-4232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506036#comment-13506036
 ] 

Daryn Sharp commented on HDFS-4232:
-----------------------------------

Ok, I need some input here.  I'm attacking the symptom of a problem encountered 
on a production cluster that encountered a series of complex issues that we 
can't piece together.  Long story short, we wound up with leases to 
non-existent files and the NN couldn't start.  When believed the dangling 
leases were the result of deletions but the cause is ambiguous.

My original intent was to make dangling leases non-fatal.  It's arguably 
reasonable because nothing will be lost - that's not already lost (the file). 
However that masks whatever bug may have caused the issue in the first place.

# Looking at the code, the inodes are deleted, then the corresponding leases 
are deleted, and then the edit log is synced outside the write lock so perhaps 
it's plausible that the edits were not atomically written?  I don't think I can 
write a test that might induce that possible bug.
# The other possibility is that {{LeaseManager#removeLeaseWithPathPrefix}} is 
malfunctioning.  It takes a {{SortedSet}}, makes another {{SortedSet}} backed 
by the former set, makes a {{List}} of {{Map.Entry}} from the set of a set, 
then iterates and the list and performs operations that will affect the 
original set.  Now the javadocs for {{Map.Entry}} state:

bq. Map.Entry objects are valid only for the duration of the iteration; more 
formally, the behavior of a map entry is undefined if the backing map has been 
modified after the entry was returned by the iterator, except through the 
setValue operation on the map entry.

It sounds like #2 is relying on undefined behavior twofold: using the entries 
after the iteration is complete, and using the entries while modifying their 
set.  I can easily fix that but I can't write a test that can reliably induce 
the former undefined behavior to occur.

So I'm at the crossroads of:
# As a safety measure, make dangling leases non-fatal
# Fix {{removeLeaseWithPrefix}} to avoid undefined behavior, and hope it solves 
the problem
# 1+2

Advice?
                
> NN fails to write a fsimage with stale leases
> ---------------------------------------------
>
>                 Key: HDFS-4232
>                 URL: https://issues.apache.org/jira/browse/HDFS-4232
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>            Priority: Blocker
>
> The reading of a fsimage will ignore leases for non-existent files, but the 
> writing of an image will fail if there are leases for non-existent files.  If 
> the image contains leases that reference a non-existent file, then the NN 
> will fail to start, and the 2NN will start but fail to ever write an image.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to