[
https://issues.apache.org/jira/browse/HADOOP-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108438#comment-13108438
]
Luke Lu commented on HADOOP-7630:
---------------------------------
bq. I changed the blurb rate to 60 seconds around August 2008 on Y! clusters.
The blurb period (for metrics, config blurb is on another period) was actually
still 5 seconds in metrics1, when we were deploying metrics2 (where we use the
default blurb period 10 second) in 2010 on Y clusters. Rajiv can confirm this.
Are you saying simon aggregator could not process less than 1k udp packets per
second? In any case, the throughput I saw (a few months ago) on the simon
aggregator is way more than that. Rajiv said that the limiting factor is not
the udp packets processing at aggregator level but the iops to store the data.
bq. The Simon plugin is only doing add and average of samples.
I'm sure you meant simon aggregator. It also does user defined calculations
(defined in the simon config file), if you lose the sole udp packet in the
reporting period, the derived metrics will not be correct, so you need a couple
of samples at least in the reporting period. While MetricVaryingRate in
metrics1 and MutableRate in metrics2 do averaging and compute throughput, which
are used mostly in rpc related metrics, most metrics in mapred are counters and
gauages and almost all the mapred throughput metrics (*PerSec) are actually
derived metrics from the simon config. This approach half the packet size vs
using the *Rate metrics in metrics sources. Simon sinks send one packet per
update, unlike ganglia, which sends one packet per metric per update.
bq. Are you concerning that the metrics might overflow if the publish rate is
at 60 seconds?
No. Even if some of them do, it's easy to see and explain on the graphs. All
metrics backend with rrdtools should handle counter wraps automatically.
bq. As a side benefit, by reducing the period, it is less amount of cycle
spend in metrics monitoring, which makes the system more efficient.
At least with metrics2, which is more efficient than metrics1, even if the
period is 1 second, it has no noticeable impact on system performance last time
I checked, as the additional a few hundred additional objects per second in the
timer thread is mostly noise compared with overall gc and context switching
throughput on busy servers.
My point is that you should not change the current default that has potential
impact on production monitoring without actually testing it at scale.
> hadoop-metrics2.properties should have a property *.period set to a default
> value foe metrics
> ---------------------------------------------------------------------------------------------
>
> Key: HADOOP-7630
> URL: https://issues.apache.org/jira/browse/HADOOP-7630
> Project: Hadoop Common
> Issue Type: Bug
> Components: conf
> Reporter: Arpit Gupta
> Assignee: Eric Yang
> Fix For: 0.20.205.0, 0.23.0
>
> Attachments: HADOOP-7630-trunk.patch, HADOOP-7630.patch
>
>
> currently the hadoop-metrics2.properties file does not have a value set for
> *.period
> This property is useful for metrics to determine when the property will
> refresh. We should set it to default of 60
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira