[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp
[ https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417206#comment-17417206 ] Sangjin Lee commented on HADOOP-17728: -- Rather than discussing what to change, I would respectfully ask the first question that needs to be answered. *What is a real problem/issue that needs to be fixed?* What is the "anomaly" that needs to be fixed? IMO, I haven't heard a real problem yet. So that we are on the same page, let me state what I believe is *NOT* a problem. The cleaner thread being in the blocked state is *not* a problem. * the cleaner thread will be blocked until a GC reference is enqueued, and that is by design; that's how the reference queue works * the cleaner thread will wake up whenever the data references get garbage collected; enqueueing is done by the JVM, so the fact that there is no Hadoop code that enqueues on the queue is not a problem * it is *NOT* true that the cleaner thread does not respond to interruption; please see the code for StatisticsDataReferenceCleaner.run() I'd like to know if there is any *real-world* problem associated with the cleaner thread and its operations. For example, does it cause unexpected exceptions? Does it cause unexpected CPU spikes? Does it cause unexpected memory increase? Does it prevent the JVM from exiting when it needs to exit? Does it cause a deadlock? Let's understand first if there is a real-world problem. And let's show evidence for that real-world problem. I hope it makes sense. Thanks. > Fix issue of the StatisticsDataReferenceCleaner cleanUp > --- > > Key: HADOOP-17728 > URL: https://issues.apache.org/jira/browse/HADOOP-17728 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 3.2.1 >Reporter: yikf >Assignee: yikf >Priority: Minor > Labels: pull-request-available, reverted > Time Spent: 5h 10m > Remaining Estimate: 0h > > Cleaner thread will be blocked if we remove reference from ReferenceQueue > unless the `queue.enqueue` called. > > As shown below, We call ReferenceQueue.remove() now while cleanUp, Call > chain as follow: > *StatisticsDataReferenceCleaner#queue.remove() -> > ReferenceQueue.remove(0) -> lock.wait(0)* > But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread > will be blocked. > > ThreadDump: > {code:java} > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 > nid=0x2119 in Object.wait() [0x7f7b0023] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:502) > at java.lang.ref.Reference.tryHandlePending(Reference.java:191) > - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp
[ https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417004#comment-17417004 ] JiangHua Zhu commented on HADOOP-17728: --- [~sjlee0], what you mentioned is very meaningful. If the occurrence of this anomaly can be reduced, it should be very beneficial. Therefore, I propose a solution here, the steps are as follows: 1. When StatisticsDataReferenceCleaner#run is executed, judge whether the number of elements of allData is greater than 0. If there are new elements, execute STATS_DATA_REF_QUEUE.remove(). If it does not exist, sleep for 30ms or 50ms; 2. When executing STATS_DATA_REF_QUEUE.remove(), give a timeout period. If my suggestion is inappropriate, please ignore it. > Fix issue of the StatisticsDataReferenceCleaner cleanUp > --- > > Key: HADOOP-17728 > URL: https://issues.apache.org/jira/browse/HADOOP-17728 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 3.2.1 >Reporter: yikf >Assignee: yikf >Priority: Minor > Labels: pull-request-available, reverted > Time Spent: 5h 10m > Remaining Estimate: 0h > > Cleaner thread will be blocked if we remove reference from ReferenceQueue > unless the `queue.enqueue` called. > > As shown below, We call ReferenceQueue.remove() now while cleanUp, Call > chain as follow: > *StatisticsDataReferenceCleaner#queue.remove() -> > ReferenceQueue.remove(0) -> lock.wait(0)* > But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread > will be blocked. > > ThreadDump: > {code:java} > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 > nid=0x2119 in Object.wait() [0x7f7b0023] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:502) > at java.lang.ref.Reference.tryHandlePending(Reference.java:191) > - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp
[ https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416774#comment-17416774 ] Sangjin Lee commented on HADOOP-17728: -- It's still not clear to me what *real* problem is being pointed out. The fact that you see the stack trace with that thread in the waiting state is *not* a problem. In fact, I would expect to see that in almost all cases as this background thread gets busy only when the statistics data gets garbage collected. That is perfectly normal. So, if we're saying the presence of this stack trace is a problem, I can say it definitely is not. On an earlier claim that this is not interruptible, it is most definitely interruptible. ReferenceQueue.remove() is interruptible and throws InterruptedException on interruption. The outer loop for StatisticsDataReferenceCleaner.run() clearly checks if the thread was interrupted and if so exits from the while loop. Please check it out. Lastly, this is a background thread whose sole job is to clean up references upon garbage collection. This has no other interaction with any other thread or operations that may be going on. I'm not sure why that is being discussed as a problem. Unless there is a clear demonstration of a real-life issue (not the stack trace), I am inclined to close this as a "not an issue". > Fix issue of the StatisticsDataReferenceCleaner cleanUp > --- > > Key: HADOOP-17728 > URL: https://issues.apache.org/jira/browse/HADOOP-17728 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 3.2.1 >Reporter: yikf >Assignee: yikf >Priority: Minor > Labels: pull-request-available, reverted > Time Spent: 5h 10m > Remaining Estimate: 0h > > Cleaner thread will be blocked if we remove reference from ReferenceQueue > unless the `queue.enqueue` called. > > As shown below, We call ReferenceQueue.remove() now while cleanUp, Call > chain as follow: > *StatisticsDataReferenceCleaner#queue.remove() -> > ReferenceQueue.remove(0) -> lock.wait(0)* > But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread > will be blocked. > > ThreadDump: > {code:java} > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 > nid=0x2119 in Object.wait() [0x7f7b0023] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:502) > at java.lang.ref.Reference.tryHandlePending(Reference.java:191) > - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp
[ https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416637#comment-17416637 ] JiangHua Zhu commented on HADOOP-17728: --- We encountered similar problems, such as: "org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #29 daemon prio=5 os_prio=0 tid=0x7fe1d55db000 nid=0x3f96 in Object.wait() [0x7fe1a3cb6000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) -waiting on <0x00072661f828> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) -locked <0x00072661f828> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) at org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3701) at java.lang.Thread.run(Thread.java:748) I think this problem may not happen by accident. > Fix issue of the StatisticsDataReferenceCleaner cleanUp > --- > > Key: HADOOP-17728 > URL: https://issues.apache.org/jira/browse/HADOOP-17728 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 3.2.1 >Reporter: yikf >Assignee: yikf >Priority: Minor > Labels: pull-request-available, reverted > Time Spent: 5h 10m > Remaining Estimate: 0h > > Cleaner thread will be blocked if we remove reference from ReferenceQueue > unless the `queue.enqueue` called. > > As shown below, We call ReferenceQueue.remove() now while cleanUp, Call > chain as follow: > *StatisticsDataReferenceCleaner#queue.remove() -> > ReferenceQueue.remove(0) -> lock.wait(0)* > But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread > will be blocked. > > ThreadDump: > {code:java} > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 > nid=0x2119 in Object.wait() [0x7f7b0023] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:502) > at java.lang.ref.Reference.tryHandlePending(Reference.java:191) > - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp
[ https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362577#comment-17362577 ] Sangjin Lee commented on HADOOP-17728: -- This caught my attention (sorry I'm not very active in the Hadoop codebase lately). I'm not quite sure what the problem statement here is. Is there a real problem that can be demonstrated with a reproducible test case? The reference queue gets enqueued not by user's explicit code but by the JVM via weak references in this case. The GC will enqueue the reference that's being garbage collected into the reference queue. That's why there is no code in the Hadoop codebase that enqueues objects explicitly to this queue. The cleaner thread is essentially a daemon thread that needs to run for the duration of the runtime to handle this task. If there is no work to be done (no relevant threads to garbage collect), that it will sit idle on the queue which is fine. If the program needs to exit and there is an interrupt, the cleaner thread *does* respond to the interrupt and does an orderly exit (see the while loop condition). So I'm still wondering what real-world problems we're observing. It might be helpful to jog your memory on HADOOP-12107 and HADOOP-12958 for past analyses that went into this. > Fix issue of the StatisticsDataReferenceCleaner cleanUp > --- > > Key: HADOOP-17728 > URL: https://issues.apache.org/jira/browse/HADOOP-17728 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 3.2.1 >Reporter: yikf >Assignee: yikf >Priority: Minor > Labels: pull-request-available, reverted > Time Spent: 5h 10m > Remaining Estimate: 0h > > Cleaner thread will be blocked if we remove reference from ReferenceQueue > unless the `queue.enqueue` called. > > As shown below, We call ReferenceQueue.remove() now while cleanUp, Call > chain as follow: > *StatisticsDataReferenceCleaner#queue.remove() -> > ReferenceQueue.remove(0) -> lock.wait(0)* > But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread > will be blocked. > > ThreadDump: > {code:java} > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 > nid=0x2119 in Object.wait() [0x7f7b0023] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:502) > at java.lang.ref.Reference.tryHandlePending(Reference.java:191) > - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp
[ https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362205#comment-17362205 ] yikf commented on HADOOP-17728: --- [~liuml07] [~Jim_Brennan] I'm thinking, the issue is remove() blocks indefinitely, so this would never let it check Thread.interrupted() again if nothing is in the queue. [https://github.com/apache/hadoop/blob/352949d07002a8435a8ff67eecf88e4aa8bd5935/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L4009] It is better that have a timeout? > Fix issue of the StatisticsDataReferenceCleaner cleanUp > --- > > Key: HADOOP-17728 > URL: https://issues.apache.org/jira/browse/HADOOP-17728 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 3.2.1 >Reporter: yikf >Assignee: yikf >Priority: Minor > Labels: pull-request-available, reverted > Time Spent: 5h 10m > Remaining Estimate: 0h > > Cleaner thread will be blocked if we remove reference from ReferenceQueue > unless the `queue.enqueue` called. > > As shown below, We call ReferenceQueue.remove() now while cleanUp, Call > chain as follow: > *StatisticsDataReferenceCleaner#queue.remove() -> > ReferenceQueue.remove(0) -> lock.wait(0)* > But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread > will be blocked. > > ThreadDump: > {code:java} > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 > nid=0x2119 in Object.wait() [0x7f7b0023] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:502) > at java.lang.ref.Reference.tryHandlePending(Reference.java:191) > - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp
[ https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362036#comment-17362036 ] Mingliang Liu commented on HADOOP-17728: Thanks [~Jim_Brennan]! I would like to keep it open for a while in case there are more comments. Apparently when this patch was discussed in PR, it was considered valid. I will follow up discussions in [HADOOP-17758]. When reporting a bug, if you find the JIRA related to the cause, you can also comment directly on the original JIRA (e.g. this one) directly instead of opening a new JIRA (e.g. HADOOP-17758). That way we can track the context one place. But if the patch that is in question is not clear, opening a new one will be better. > Fix issue of the StatisticsDataReferenceCleaner cleanUp > --- > > Key: HADOOP-17728 > URL: https://issues.apache.org/jira/browse/HADOOP-17728 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 3.2.1 >Reporter: yikf >Assignee: yikf >Priority: Minor > Labels: pull-request-available, reverted > Time Spent: 5h 10m > Remaining Estimate: 0h > > Cleaner thread will be blocked if we remove reference from ReferenceQueue > unless the `queue.enqueue` called. > > As shown below, We call ReferenceQueue.remove() now while cleanUp, Call > chain as follow: > *StatisticsDataReferenceCleaner#queue.remove() -> > ReferenceQueue.remove(0) -> lock.wait(0)* > But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread > will be blocked. > > ThreadDump: > {code:java} > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 > nid=0x2119 in Object.wait() [0x7f7b0023] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:502) > at java.lang.ref.Reference.tryHandlePending(Reference.java:191) > - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp
[ https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361795#comment-17361795 ] Jim Brennan commented on HADOOP-17728: -- [~liuml07] should we close this as invalid? > Fix issue of the StatisticsDataReferenceCleaner cleanUp > --- > > Key: HADOOP-17728 > URL: https://issues.apache.org/jira/browse/HADOOP-17728 > Project: Hadoop Common > Issue Type: Bug > Components: fs >Affects Versions: 3.2.1 >Reporter: yikf >Assignee: yikf >Priority: Minor > Labels: pull-request-available, reverted > Time Spent: 5h 10m > Remaining Estimate: 0h > > Cleaner thread will be blocked if we remove reference from ReferenceQueue > unless the `queue.enqueue` called. > > As shown below, We call ReferenceQueue.remove() now while cleanUp, Call > chain as follow: > *StatisticsDataReferenceCleaner#queue.remove() -> > ReferenceQueue.remove(0) -> lock.wait(0)* > But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread > will be blocked. > > ThreadDump: > {code:java} > "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 > nid=0x2119 in Object.wait() [0x7f7b0023] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:502) > at java.lang.ref.Reference.tryHandlePending(Reference.java:191) > - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock) > at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org