[
https://issues.apache.org/jira/browse/HDFS-14575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077886#comment-17077886
]
Tao Yang commented on HDFS-14575:
---------------------------------
Thanks [~weichiu] for the review and suggestion.
You are right, we can prevent that failure if each lease renewer only has one
daemon thread, I think the approach you gave can works fine for this issue. I'm
planning to add stopping state for the LeaseRenewer, it can be updated when the
LeaseRenewer is exiting (in the synchronized block of LeaseRenewer#run), and
can be checked at the beginning of LeaseRenewer#put (synchronized method, if
stopping is true, throw exception and then handle it in
DFSClient#beginFileLease as you suggested), in this way we can guarantee that
the lease renewer into which client is putted won't exit unexpectedly (since
emptyTime is updated to Long.MAX_VALUE at the end of LeaseRenewer#put). Hope to
hear your thoughts.
The UT can simulate the race condition but hard to verify the fix, any
suggestion about this?
> LeaseRenewer#daemon threads leak in DFSClient
> ---------------------------------------------
>
> Key: HDFS-14575
> URL: https://issues.apache.org/jira/browse/HDFS-14575
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.1.0
> Reporter: Tao Yang
> Assignee: Tao Yang
> Priority: Major
> Attachments: HDFS-14575.001.patch, HDFS-14575.002.patch
>
>
> Currently LeaseRenewer (and its daemon thread) without clients should be
> terminated after a grace period which defaults to 60 seconds. A race
> condition may happen when a new request is coming just after LeaseRenewer
> expired.
> Reproduce this race condition:
> # Client#1 creates File#1: creates LeaseRenewer#1 and starts Daemon#1
> thread, after a few seconds, File#1 is closed , there is no clients in
> LeaseRenewer#1 now.
> # 60 seconds (grace period) later, LeaseRenewer#1 just expires but daemon#1
> thread is still in sleep, Client#1 creates File#2, lead to the creation of
> Daemon#2.
> # Daemon#1 is awake then exit, after that, LeaseRenewer#1 is removed from
> factory.
> # File#2 is closed after a few seconds, LeaseRenewer#2 is created since it
> can’t get renewer from factory.
> Daemon#2 thread leaks from now on, since Client#1 in it can never be removed
> and it won't have a chance to stop.
> To solve this problem, IIUIC, a simple way I think is to make sure that all
> clients are cleared when LeaseRenewer is removed from factory. Please feel
> free to give your suggestions. Thanks!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]