[jira] [Commented] (HDFS-9137) DeadLock between DataNode#refreshVolumes and BPOfferService#registrationSucceeded

Yi Liu (JIRA) Sat, 03 Oct 2015 18:00:07 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942520#comment-14942520
 ]


Yi Liu commented on HDFS-9137:
------------------------------

Thanks Uma for the patch. I agree we don't need synchronization of {{DataNode}} 
for Datanode#triggerBlockReport, so the current fix looks good overall. My 
comment is:

{code}
+      IOException exception = null;
       try {
         LOG.info("Reconfiguring " + property + " to " + newVal);
         this.refreshVolumes(newVal);
       } catch (IOException e) {
-        throw new ReconfigurationException(property, newVal,
-            getConf().get(property), e);
+        exception = e;
+      } finally {
+        // Send a full block report to let NN acknowledge the volume changes.
+        try {
+          triggerBlockReport(
+              new BlockReportOptions.Factory().setIncremental(false).build());
+        } catch (IOException e) {
+          LOG.warn("Exception while sending the block report after refresh"
+              + " volumes " + property + " to " + newVal, e);
+        } finally {
+          if (exception != null) {
+            throw new ReconfigurationException(property, newVal,
+                getConf().get(property), exception);
+          }
+        }
{code}

I think for the IOException of {{refreshVolumes}}, we just need to wrap it to 
ReconfigurationException then throw as original. For the exception of 
{{triggerBlockReport}}, we also need to wrap it and throw.  I see you just log 
warn for the IOException of {{triggerBlockReport}} in the patch and ignore it, 
any reason to do this?  If we want to do this, we don't need to save the 
IOException of refreshVolumns and throw it in the finally clause of 
triggerBlockReport, we just need to throw it directly, since the {{finally }} 
cause is executed before the throwing happens.

> DeadLock between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded 
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-9137
>                 URL: https://issues.apache.org/jira/browse/HDFS-9137
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.0.0, 2.7.1
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-9137.00.patch
>
>
> I can see this code flows between DataNode#refreshVolumes and 
> BPOfferService#registrationSucceeded could cause deadLock.
> In practice situation may be rare as user calling refreshVolumes at the time 
> DN registration with NN. But seems like issue can happen.
>  Reason for deadLock:
>   1) refreshVolumes will be called with DN lock and after at the end it will 
> also trigger Block report. In the Block report call, 
> BPServiceActor#triggerBlockReport calls toString on bpos. Here it takes 
> readLock on bpos.
>  DN lock then boos lock
> 2) BPOfferSetrvice#registrationSucceeded call is taking writeLock on bpos and 
>  calling dn.bpRegistrationSucceeded which is again synchronized call on DN.
> bpos lock and then DN lock.
> So, this can clearly create dead lock.
> I think simple fix could be to move triggerBlockReport call outside out DN 
> lock and I feel that call may not be really needed inside DN lock.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9137) DeadLock between DataNode#refreshVolumes and BPOfferService#registrationSucceeded

Reply via email to