[ 
https://issues.apache.org/jira/browse/HDFS-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254503#comment-15254503
 ] 

Colin Patrick McCabe commented on HDFS-10175:
---------------------------------------------

bq. Can I also note that as the @Public @Stable FileSystem is widely 
subclassed, with its protected statistics field accessed in those subclasses, 
nobody is allowed to take it or its current methods away. Thanks.

Yeah, I agree.  I would like to see us get more cautious about adding new 
things to {{FileSystem#Statistics}}, though, since I think it's not a good 
match for most of the new stats we're proposing here.

bq. There's no per-thread tracking, —its collecting overall stats, rather than 
trying to add up the cost of a single execution, which is what per-thread stuff 
would, presumably do. This is lower cost but still permits microbenchmark-style 
analysis of performance problems against S3a. It doesn't directly let you get 
results of a job, "34MB of data, 2000 stream aborts, 1998 backward seeks" which 
are the kind of things I'm curious about.

Overall stats are lower cost in terms of memory consumption, and the cost to 
read (as opposed to update) a metric.  They are higher cost in terms of the CPU 
consumed for each update of the metric.  In particular, for applications that 
do a lot of stream operations from many different threads, updating an 
AtomicLong can become a performance bottleneck

One of the points that I was making above is that I think it's appropriate for 
some metrics to be tracked per-thread, but for others, we probably want to use 
AtomicLong or similar.  I would expect that anything that led to an s3 RPC 
would be appropriate to be tracked by an AtomicLong very easily, since the 
overhead of the network activity would dwarf the AtomicLong update overhead.  
And we should have a common interface for getting this information that MR and 
stats consumers can use.

bq. Maybe, and this would be nice, whatever is implemented here is (a) 
extensible to support some duration type too, at least in parallel, 

The interface here supports storing durations as 64-bit numbers of 
milliseconds, which seems good.  It is up to the implementor of the statistic 
to determine what the 64-bit long represents (average duration in ms, median 
duration in ms, number of RPCs, etc. etc.)  This is similar to metrics2 and 
jmx, etc. where you have basic types that can be used in a few different ways.

bq. and (b) could be used as a back end by both Metrics2 and Coda Hale metrics 
registries. That way the slightly more expensive metric systems would have 
access to this more raw data.

Sure.  The difficult question is how metrics2 hooks up to metrics which are per 
FS or per-stream.  Should the output of metrics2 reflect the union of all 
existing FS and stream instances?  Some applications open a very large number 
of streams, so it seems impractical for metrics2 to include all those streams 
in its output.

> add per-operation stats to FileSystem.Statistics
> ------------------------------------------------
>
>                 Key: HDFS-10175
>                 URL: https://issues.apache.org/jira/browse/HDFS-10175
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>            Reporter: Ram Venkatesh
>            Assignee: Mingliang Liu
>         Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch, 
> HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch, 
> HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java
>
>
> Currently FileSystem.Statistics exposes the following statistics:
> BytesRead
> BytesWritten
> ReadOps
> LargeReadOps
> WriteOps
> These are in-turn exposed as job counters by MapReduce and other frameworks. 
> There is logic within DfsClient to map operations to these counters that can 
> be confusing, for instance, mkdirs counts as a writeOp.
> Proposed enhancement:
> Add a statistic for each DfsClient operation including create, append, 
> createSymlink, delete, exists, mkdirs, rename and expose them as new 
> properties on the Statistics object. The operation-specific counters can be 
> used for analyzing the load imposed by a particular job on HDFS. 
> For example, we can use them to identify jobs that end up creating a large 
> number of files.
> Once this information is available in the Statistics object, the app 
> frameworks like MapReduce can expose them as additional counters to be 
> aggregated and recorded as part of job summary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to