[ 
https://issues.apache.org/jira/browse/HADOOP-8050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HADOOP-8050:
-------------------------------

    Attachment: hadoop-8050.patch.txt

If a lot of methods are synchronized and two classes containing them have 
interdependency, deadlock is likely.

The current way of locking in metrics is a little excessive. I do not believe 
the strict global consistency is required in processing metrics. For one, 
sources are not cordinating with each other (they are mostly independent), so 
locking the whole subsystem and taking snapshot does not add much value to the 
quality of data. 

This patch removes some locks around accessing the source adapter map within 
MetricsSystemImpl. This makes the metric snapshot only lock on each individual 
source adapter, one at a time, instead of the entire metrics impl.  This is 
safe because:

* Once sources are registered, they are not removed until shutdown(). Even 
shoutdown() or stop() is called rarely.

* During snapshot, the source adapter hashmap is the only data structure that 
needs protection.

* snapshot() is only called from the timer event handler. startTimer() makes 
sure that there is only one timer.

I wrapped the LinkeHashMap used for the source adapter map with 
Collections.synchronizedMap. This made accessing the data structure safe 
without holding a big coarse lock. No further synchronization between sources 
seem needed.

                
> Deadlock in metrics
> -------------------
>
>                 Key: HADOOP-8050
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8050
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: metrics
>    Affects Versions: 0.20.204.0, 0.20.205.0, 1.0.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>             Fix For: 1.1.0, 1.0.1
>
>         Attachments: hadoop-8050.patch.txt
>
>
> The metrics serving thread and the periodic snapshot thread can deadlock.
> It happened a few times on one of namenodes we have. When it happens RPC 
> works but the web ui and hftp stop working. I haven't look at the trunk too 
> closely, but it might happen there too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to