[
https://issues.apache.org/jira/browse/HDFS-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485649#comment-15485649
]
Erik Krogen edited comment on HDFS-10475 at 9/12/16 11:47 PM:
--------------------------------------------------------------
To get a mapping of operation -> lock time metrics we propose the following:
1. Move the logging/metrics logic into FSNamesystemLock rather than
FSNamesystem to centralize logic and tracking.
2. Add new methods, {{(read|write)Unlock(operation)}}, in which you specify a
name for the current operation as you unlock (note that for metrics collecting
the name is only needed on unlock). If an operation is not specified, a
catch-all 'default' or 'other' operation would be used. We would manually add
the name of the operation to the unlock call for those operations which we
think are likely to contribute significantly to the overall lock hold time.
This is a manual process since otherwise we would need to get a stack trace (to
find the method name) on each call to {{unlock}} which may be prohibitively
expensive.
3. Add a map of OperationName -> MutableRate metrics to FSNamesystemLock, all
of which are also contained within a MetricsRegistry. On each time a lock is
released we look up the corresponding MutableRate and add a value for the lock
hold time. We do not use the map within MetricsRegistry because it is
synchronized and we do not want contention on this map to cause slowness around
the FSNamesystem lock.
The best type of map to use within FSNamesystemLock to hold the MutableRate
metrics is tricky. Ideally we would use a Java 8 ConcurrentHashMap, using
{{computeIfAbsent}} to create new MutableRate metrics objects and insert them
into the registry whenever a new operation is encountered. However this
functionality is not available in Java 7 and we would like to support older
versions. Thus we propose using a regular HashMap (wrapped within a call to
{{Collections.unmodifiableMap}}) which is initialized with all of the different
operations at the time the FSNamesystemLock is created. This allows for
lock-free access, but requires that we have a list of all the possible
operations. So we suggest an Enum, e.g. FSNamesystemLockMetricOp, which lists
all of the operations of interest to be supplied to the {{(read|write)Unlock}}
calls. This would likely be a list of a few dozen operations of interest which
are likely to be relatively expensive lock holders. Operations not listed
within this Enum would be regarded as "other"/"default".
We believe this is the right tradeoff between granularity of metrics,
performance, and developer effort, but it is certainly not ideal in terms of
manual effort required. We would be interested to hear any other ideas about
how to make the metrics collection require less manual intervention.
was (Author: xkrogen):
To get a mapping of operation -> lock time metrics we propose the following:
1. Move the logging/metrics logic into FSNamesystemLock rather than
FSNamesystem to centralize logic and tracking.
2. Add new methods, {{(read|write)Unlock(operation)}}, in which you specify a
name for the current operation as you unlock (note that for metrics collecting
the name is only needed on unlock). If an operation is not specified, a
catch-all 'default' or 'other' operation would be used. We would manually add
the name of the operation to the unlock call for those operations which we
think are likely to contribute significantly to the overall lock hold time.
This is a manual process since otherwise we would need to get a stack trace (to
find the method name) on each call to {{unlock}} which may be prohibitively
expensive.
3. FSNamesystemLock contains a map of OperationName -> MutableRate metrics, all
of which are also contained within a MetricsRegistry. On each time a lock is
released we look up the corresponding MutableRate and add a value for the lock
hold time. We do not use the map within MetricsRegistry because it is
synchronized and we do not want contention on this map to cause slowness around
the FSNamesystem lock.
The best type of map to use within FSNamesystemLock to hold the MutableRate
metrics is tricky. Ideally we would use a Java 8 ConcurrentHashMap, using
{{computeIfAbsent}} to create new MutableRate metrics objects and insert them
into the registry whenever a new operation is encountered. However this
functionality is not available in Java 7 and we would like to support older
versions. Thus we propose using a regular HashMap (wrapped within a call to
{{Collections.unmodifiableMap}}) which is initialized with all of the different
operations at the time the FSNamesystemLock is created. This allows for
lock-free access, but requires that we have a list of all the possible
operations. So we suggest an Enum, e.g. FSNamesystemLockMetricOp, which lists
all of the operations of interest to be supplied to the {{(read|write)Unlock}}
calls. This would likely be a list of a few dozen operations of interest which
are likely to be relatively expensive lock holders. Operations not listed
within this Enum would be regarded as "other"/"default".
We believe this is the right tradeoff between granularity of metrics,
performance, and developer effort, but it is certainly not ideal in terms of
manual effort required. We would be interested to hear any other ideas about
how to make the metrics collection require less manual intervention.
> Adding metrics for long FSNamesystem read and write locks
> ---------------------------------------------------------
>
> Key: HDFS-10475
> URL: https://issues.apache.org/jira/browse/HDFS-10475
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Reporter: Xiaoyu Yao
> Assignee: Erik Krogen
>
> This is a follow up of the comment on HADOOP-12916 and
> [here|https://issues.apache.org/jira/browse/HDFS-9924?focusedCommentId=15310837&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15310837]
> add more metrics and WARN/DEBUG logs for long FSD/FSN locking operations on
> namenode similar to what we have for slow write/network WARN/metrics on
> datanode.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]