[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp

2021-09-18 Thread Sangjin Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417206#comment-17417206
 ] 

Sangjin Lee commented on HADOOP-17728:
--

Rather than discussing what to change, I would respectfully ask the first 
question that needs to be answered.

*What is a real problem/issue that needs to be fixed?* What is the "anomaly" 
that needs to be fixed? IMO, I haven't heard a real problem yet.

So that we are on the same page, let me state what I believe is *NOT* a 
problem. The cleaner thread being in the blocked state is *not* a problem.
 * the cleaner thread will be blocked until a GC reference is enqueued, and 
that is by design; that's how the reference queue works
 * the cleaner thread will wake up whenever the data references get garbage 
collected; enqueueing is done by the JVM, so the fact that there is no Hadoop 
code that enqueues on the queue is not a problem
 * it is *NOT* true that the cleaner thread does not respond to interruption; 
please see the code for StatisticsDataReferenceCleaner.run()

I'd like to know if there is any *real-world* problem associated with the 
cleaner thread and its operations. For example, does it cause unexpected 
exceptions? Does it cause unexpected CPU spikes? Does it cause unexpected 
memory increase? Does it prevent the JVM from exiting when it needs to exit? 
Does it cause a deadlock? Let's understand first if there is a real-world 
problem. And let's show evidence for that real-world problem. I hope it makes 
sense. Thanks.

> Fix issue of the StatisticsDataReferenceCleaner cleanUp
> ---
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.2.1
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
>  Labels: pull-request-available, reverted
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue 
> unless the `queue.enqueue` called.
> 
>     As shown below, We call ReferenceQueue.remove() now while cleanUp, Call 
> chain as follow:
>                          *StatisticsDataReferenceCleaner#queue.remove()  ->  
> ReferenceQueue.remove(0)  -> lock.wait(0)*
>     But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread 
> will be blocked.
>  
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 
> nid=0x2119 in Object.wait() [0x7f7b0023]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp

2021-09-17 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417004#comment-17417004
 ] 

JiangHua Zhu commented on HADOOP-17728:
---

[~sjlee0], what you mentioned is very meaningful.
If the occurrence of this anomaly can be reduced, it should be very beneficial.
Therefore, I propose a solution here, the steps are as follows:
1. When StatisticsDataReferenceCleaner#run is executed, judge whether the 
number of elements of allData is greater than 0. If there are new elements, 
execute STATS_DATA_REF_QUEUE.remove(). If it does not exist, sleep for 30ms or 
50ms;
2. When executing STATS_DATA_REF_QUEUE.remove(), give a timeout period.

If my suggestion is inappropriate, please ignore it.

> Fix issue of the StatisticsDataReferenceCleaner cleanUp
> ---
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.2.1
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
>  Labels: pull-request-available, reverted
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue 
> unless the `queue.enqueue` called.
> 
>     As shown below, We call ReferenceQueue.remove() now while cleanUp, Call 
> chain as follow:
>                          *StatisticsDataReferenceCleaner#queue.remove()  ->  
> ReferenceQueue.remove(0)  -> lock.wait(0)*
>     But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread 
> will be blocked.
>  
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 
> nid=0x2119 in Object.wait() [0x7f7b0023]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp

2021-09-17 Thread Sangjin Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416774#comment-17416774
 ] 

Sangjin Lee commented on HADOOP-17728:
--

It's still not clear to me what *real* problem is being pointed out. The fact 
that you see the stack trace with that thread in the waiting state is *not* a 
problem. In fact, I would expect to see that in almost all cases as this 
background thread gets busy only when the statistics data gets garbage 
collected. That is perfectly normal. So, if we're saying the presence of this 
stack trace is a problem, I can say it definitely is not.

On an earlier claim that this is not interruptible, it is most definitely 
interruptible. ReferenceQueue.remove() is interruptible and throws 
InterruptedException on interruption. The outer loop for 
StatisticsDataReferenceCleaner.run() clearly checks if the thread was 
interrupted and if so exits from the while loop. Please check it out.

Lastly, this is a background thread whose sole job is to clean up references 
upon garbage collection. This has no other interaction with any other thread or 
operations that may be going on. I'm not sure why that is being discussed as a 
problem.

Unless there is a clear demonstration of a real-life issue (not the stack 
trace), I am inclined to close this as a "not an issue".

> Fix issue of the StatisticsDataReferenceCleaner cleanUp
> ---
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.2.1
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
>  Labels: pull-request-available, reverted
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue 
> unless the `queue.enqueue` called.
> 
>     As shown below, We call ReferenceQueue.remove() now while cleanUp, Call 
> chain as follow:
>                          *StatisticsDataReferenceCleaner#queue.remove()  ->  
> ReferenceQueue.remove(0)  -> lock.wait(0)*
>     But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread 
> will be blocked.
>  
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 
> nid=0x2119 in Object.wait() [0x7f7b0023]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp

2021-09-17 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416637#comment-17416637
 ] 

JiangHua Zhu commented on HADOOP-17728:
---

We encountered similar problems, such as:
"org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner" #29 
daemon prio=5 os_prio=0 tid=0x7fe1d55db000 nid=0x3f96 in Object.wait() 
[0x7fe1a3cb6000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
-waiting on <0x00072661f828> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
-locked <0x00072661f828> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
at 
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3701)
at java.lang.Thread.run(Thread.java:748)
I think this problem may not happen by accident.

> Fix issue of the StatisticsDataReferenceCleaner cleanUp
> ---
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.2.1
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
>  Labels: pull-request-available, reverted
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue 
> unless the `queue.enqueue` called.
> 
>     As shown below, We call ReferenceQueue.remove() now while cleanUp, Call 
> chain as follow:
>                          *StatisticsDataReferenceCleaner#queue.remove()  ->  
> ReferenceQueue.remove(0)  -> lock.wait(0)*
>     But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread 
> will be blocked.
>  
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 
> nid=0x2119 in Object.wait() [0x7f7b0023]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp

2021-06-13 Thread Sangjin Lee (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362577#comment-17362577
 ] 

Sangjin Lee commented on HADOOP-17728:
--

This caught my attention (sorry I'm not very active in the Hadoop codebase 
lately).

I'm not quite sure what the problem statement here is. Is there a real problem 
that can be demonstrated with a reproducible test case? The reference queue 
gets enqueued not by user's explicit code but by the JVM via weak references in 
this case. The GC will enqueue the reference that's being garbage collected 
into the reference queue. That's why there is no code in the Hadoop codebase 
that enqueues objects explicitly to this queue.

The cleaner thread is essentially a daemon thread that needs to run for the 
duration of the runtime to handle this task. If there is no work to be done (no 
relevant threads to garbage collect), that it will sit idle on the queue which 
is fine. If the program needs to exit and there is an interrupt, the cleaner 
thread *does* respond to the interrupt and does an orderly exit (see the while 
loop condition). So I'm still wondering what real-world problems we're 
observing.

It might be helpful to jog your memory on HADOOP-12107 and HADOOP-12958 for 
past analyses that went into this.

> Fix issue of the StatisticsDataReferenceCleaner cleanUp
> ---
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.2.1
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
>  Labels: pull-request-available, reverted
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue 
> unless the `queue.enqueue` called.
> 
>     As shown below, We call ReferenceQueue.remove() now while cleanUp, Call 
> chain as follow:
>                          *StatisticsDataReferenceCleaner#queue.remove()  ->  
> ReferenceQueue.remove(0)  -> lock.wait(0)*
>     But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread 
> will be blocked.
>  
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 
> nid=0x2119 in Object.wait() [0x7f7b0023]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp

2021-06-11 Thread yikf (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362205#comment-17362205
 ] 

yikf commented on HADOOP-17728:
---

[~liuml07] [~Jim_Brennan]

I'm thinking, the issue is remove() blocks indefinitely, so this would never 
let it check Thread.interrupted() again if nothing is in the queue.

[https://github.com/apache/hadoop/blob/352949d07002a8435a8ff67eecf88e4aa8bd5935/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L4009]

It is better that have a timeout?

> Fix issue of the StatisticsDataReferenceCleaner cleanUp
> ---
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.2.1
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
>  Labels: pull-request-available, reverted
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue 
> unless the `queue.enqueue` called.
> 
>     As shown below, We call ReferenceQueue.remove() now while cleanUp, Call 
> chain as follow:
>                          *StatisticsDataReferenceCleaner#queue.remove()  ->  
> ReferenceQueue.remove(0)  -> lock.wait(0)*
>     But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread 
> will be blocked.
>  
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 
> nid=0x2119 in Object.wait() [0x7f7b0023]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp

2021-06-11 Thread Mingliang Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362036#comment-17362036
 ] 

Mingliang Liu commented on HADOOP-17728:


Thanks [~Jim_Brennan]! I would like to keep it open for a while in case there 
are more comments. Apparently when this patch was discussed in PR, it was 
considered valid. I will follow up discussions in [HADOOP-17758].

When reporting a bug, if you find the JIRA related to the cause, you can also 
comment directly on the original JIRA (e.g. this one) directly instead of 
opening a new JIRA (e.g. HADOOP-17758). That way we can track the context one 
place. But if the patch that is in question is not clear, opening a new one 
will be better.



> Fix issue of the StatisticsDataReferenceCleaner cleanUp
> ---
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.2.1
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
>  Labels: pull-request-available, reverted
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue 
> unless the `queue.enqueue` called.
> 
>     As shown below, We call ReferenceQueue.remove() now while cleanUp, Call 
> chain as follow:
>                          *StatisticsDataReferenceCleaner#queue.remove()  ->  
> ReferenceQueue.remove(0)  -> lock.wait(0)*
>     But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread 
> will be blocked.
>  
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 
> nid=0x2119 in Object.wait() [0x7f7b0023]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-17728) Fix issue of the StatisticsDataReferenceCleaner cleanUp

2021-06-11 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361795#comment-17361795
 ] 

Jim Brennan commented on HADOOP-17728:
--

[~liuml07] should we close this as invalid?

> Fix issue of the StatisticsDataReferenceCleaner cleanUp
> ---
>
> Key: HADOOP-17728
> URL: https://issues.apache.org/jira/browse/HADOOP-17728
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 3.2.1
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
>  Labels: pull-request-available, reverted
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Cleaner thread will be blocked if we remove reference from ReferenceQueue 
> unless the `queue.enqueue` called.
> 
>     As shown below, We call ReferenceQueue.remove() now while cleanUp, Call 
> chain as follow:
>                          *StatisticsDataReferenceCleaner#queue.remove()  ->  
> ReferenceQueue.remove(0)  -> lock.wait(0)*
>     But, lock.notifyAll is called when queue.enqueue only, so Cleaner thread 
> will be blocked.
>  
> ThreadDump:
> {code:java}
> "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x7f7afc088800 
> nid=0x2119 in Object.wait() [0x7f7b0023]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
> - locked <0xc00c2f58> (a java.lang.ref.Reference$Lock)
> at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org