[ 
https://issues.apache.org/jira/browse/HDFS-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554629#comment-13554629
 ] 

liang xie commented on HDFS-4359:
---------------------------------

Yes, the root cause is at namenode side,  this issue is just against to 
datanode,  we can remove the "synchronized" keyword safely per thread dump and 
source code,  though it didn't help for the whole hung accident:)


bq. the thread holding the lock is stuck in a 'versionRequest()' RPC. Any idea 
why this RPC is taking a long time hearing back from the NN?
yes, we've figured it out several days before, one of DNS servers is in 
accident, but the thread dump is really interesting, i've uploaded the NN 
thread dump for you enjoy it:)   btw, the JUCL lock is not easy to find the 
lock-holder, which make us difficult to analyis...
                
> remove an unnecessary synchronized keyword in BPOfferService.java
> -----------------------------------------------------------------
>
>                 Key: HDFS-4359
>                 URL: https://issues.apache.org/jira/browse/HDFS-4359
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0, 2.0.2-alpha
>            Reporter: liang xie
>            Assignee: liang xie
>         Attachments: dn.jstack, HDFS-4359.txt, nn_dns_broken.jstack
>
>
> we encountered a NN&DN hung issue, the DN hung was caused by no NN response 
> for heartbeat. Per DN thread dump, i think we can have a little improvement 
> on this detail code :
>   synchronized List<BPServiceActor> getBPServiceActors() {
>     return Lists.newArrayList(bpServices);
>   }
> the bpServices is declared as :
>   private List<BPServiceActor> bpServices =
>     new CopyOnWriteArrayList<BPServiceActor>();
> It's a thread-safe variant indead, so we can remove the above synchronized 
> keyword safely, IMHO.
> Here is a simple statistic for thread dump: 
> xieliang@xieliang:/tmp$ grep 0x00000007b00289f0 dn.jstack |wc -l
> 252

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to