[
https://issues.apache.org/jira/browse/HADOOP-17728?focusedWorklogId=601522&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-601522
]
ASF GitHub Bot logged work on HADOOP-17728:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 25/May/21 04:52
Start Date: 25/May/21 04:52
Worklog Time Spent: 10m
Work Description: liuml07 edited a comment on pull request #3042:
URL: https://github.com/apache/hadoop/pull/3042#issuecomment-847527081
Thanks for including me, @steveloughran
Let me first understand the problem: unless new reference object is
available in the queue (Java code calling `enqueue()`), those existing
references will not be cleaned up forever. That is because when calling
`remove()`, the `StatisticsDataReferenceCleaner` thread will wait forever in
case there is no notify/notifyAll events upon the internal queue lock.
To fix the problem, here we propose to call `remove(timeout)` version in the
`StatisticsDataReferenceCleaner` thread. Its timeout value will be honored when
waiting for internal queue lock. That will give the cleaner thread an
opportunity to dequeue periodically - instead of getting blocked forever if no
notify event happens to the internal queue lock. Eventually, all reference
object in the queue will get cleaned up by cleaner with this mechanism.
That makes sense to me, if I understand the problem and solution correctly.
Let me know @yikf
As to implementation, I agree 100s might be too stingy to this cleanup (we
remove one every time, so essentially 100s to cleanup one at best). I'm also
wondering if 100ms is too generous here. How many threads do we target here? To
my best knowledge, 1K is pretty large and close to the upper limit. To cleanup
everything eventually AND without any help of enqueue events, it takes 10min to
cleanup everything, if the timeout is 600ms. Is this a reasonable value?
I see you refer to Spark settings, but I assume that is targeting much more
references including RDD, shuffle, and broadcast state etc?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 601522)
Time Spent: 2.5h (was: 2h 20m)
> Deadlock in FileSystem StatisticsDataReferenceCleaner cleanUp
> -------------------------------------------------------------
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs
> Affects Versions: 3.2.1
> Reporter: yikf
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 2.5h
> Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue
> unless the `queue.enqueue` called.
> ----
> As shown below, We call ReferenceQueue.remove() now while cleanUp, Call
> chain as follow:
> *StatisticsDataReferenceCleaner#queue.remove() ->
> ReferenceQueue.remove(0) -> lock.wait(0)*
> But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread
> will be blocked.
>
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f7afc088800
> nid=0x2119 in Object.wait() [0x00007f7b00230000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x00000000c00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0x00000000c00c2f58> (a java.lang.ref.Reference$Lock)
> at
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]