[ 
https://issues.apache.org/jira/browse/HADOOP-7630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108438#comment-13108438
 ] 

Luke Lu commented on HADOOP-7630:
---------------------------------

bq.  I changed the blurb rate to 60 seconds around August 2008 on Y! clusters.

The blurb period (for metrics, config blurb is on another period) was actually 
still 5 seconds in metrics1, when we were deploying metrics2 (where we use the 
default blurb period 10 second) in 2010 on Y clusters. Rajiv can confirm this. 
Are you saying simon aggregator could not process less than 1k udp packets per 
second? In any case, the throughput I saw (a few months ago) on the simon 
aggregator is way more than that. Rajiv said that the limiting factor is not 
the udp packets processing at aggregator level but the iops to store the data.

bq.  The Simon plugin is only doing add and average of samples.

I'm sure you meant simon aggregator. It also does user defined calculations 
(defined in the simon config file), if you lose the sole udp packet in the 
reporting period, the derived metrics will not be correct, so you need a couple 
of samples at least in the reporting period. While MetricVaryingRate in 
metrics1 and MutableRate in metrics2 do averaging and compute throughput, which 
are used mostly in rpc related metrics, most metrics in mapred are counters and 
gauages and almost all the mapred throughput metrics (*PerSec) are actually 
derived metrics from the simon config. This approach half the packet size vs 
using the *Rate metrics in metrics sources. Simon sinks send one packet per 
update, unlike ganglia, which sends one packet per metric per update.

bq. Are you concerning that the metrics might overflow if the publish rate is 
at 60 seconds?

No. Even if some of them do, it's easy to see and explain on the graphs. All 
metrics backend with rrdtools should handle counter wraps automatically.

bq.  As a side benefit, by reducing the period, it is less amount of cycle 
spend in metrics monitoring, which makes the system more efficient.

At least with metrics2, which is more efficient than metrics1, even if the 
period is 1 second, it has no noticeable impact on system performance last time 
I checked, as the additional a few hundred additional objects per second in the 
timer thread is mostly noise compared with overall gc and context switching 
throughput on busy servers.

My point is that you should not change the current default that has potential 
impact on production monitoring without actually testing it at scale.


> hadoop-metrics2.properties should have a property *.period set to a default 
> value foe metrics
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-7630
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7630
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: conf
>            Reporter: Arpit Gupta
>            Assignee: Eric Yang
>             Fix For: 0.20.205.0, 0.23.0
>
>         Attachments: HADOOP-7630-trunk.patch, HADOOP-7630.patch
>
>
> currently the hadoop-metrics2.properties file does not have a value set for 
> *.period
> This property is useful for metrics to determine when the property will 
> refresh. We should set it to default of 60

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to