[
https://issues.apache.org/jira/browse/HDFS-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254503#comment-15254503
]
Colin Patrick McCabe commented on HDFS-10175:
---------------------------------------------
bq. Can I also note that as the @Public @Stable FileSystem is widely
subclassed, with its protected statistics field accessed in those subclasses,
nobody is allowed to take it or its current methods away. Thanks.
Yeah, I agree. I would like to see us get more cautious about adding new
things to {{FileSystem#Statistics}}, though, since I think it's not a good
match for most of the new stats we're proposing here.
bq. There's no per-thread tracking, —its collecting overall stats, rather than
trying to add up the cost of a single execution, which is what per-thread stuff
would, presumably do. This is lower cost but still permits microbenchmark-style
analysis of performance problems against S3a. It doesn't directly let you get
results of a job, "34MB of data, 2000 stream aborts, 1998 backward seeks" which
are the kind of things I'm curious about.
Overall stats are lower cost in terms of memory consumption, and the cost to
read (as opposed to update) a metric. They are higher cost in terms of the CPU
consumed for each update of the metric. In particular, for applications that
do a lot of stream operations from many different threads, updating an
AtomicLong can become a performance bottleneck
One of the points that I was making above is that I think it's appropriate for
some metrics to be tracked per-thread, but for others, we probably want to use
AtomicLong or similar. I would expect that anything that led to an s3 RPC
would be appropriate to be tracked by an AtomicLong very easily, since the
overhead of the network activity would dwarf the AtomicLong update overhead.
And we should have a common interface for getting this information that MR and
stats consumers can use.
bq. Maybe, and this would be nice, whatever is implemented here is (a)
extensible to support some duration type too, at least in parallel,
The interface here supports storing durations as 64-bit numbers of
milliseconds, which seems good. It is up to the implementor of the statistic
to determine what the 64-bit long represents (average duration in ms, median
duration in ms, number of RPCs, etc. etc.) This is similar to metrics2 and
jmx, etc. where you have basic types that can be used in a few different ways.
bq. and (b) could be used as a back end by both Metrics2 and Coda Hale metrics
registries. That way the slightly more expensive metric systems would have
access to this more raw data.
Sure. The difficult question is how metrics2 hooks up to metrics which are per
FS or per-stream. Should the output of metrics2 reflect the union of all
existing FS and stream instances? Some applications open a very large number
of streams, so it seems impractical for metrics2 to include all those streams
in its output.
> add per-operation stats to FileSystem.Statistics
> ------------------------------------------------
>
> Key: HDFS-10175
> URL: https://issues.apache.org/jira/browse/HDFS-10175
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client
> Reporter: Ram Venkatesh
> Assignee: Mingliang Liu
> Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch,
> HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch,
> HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java
>
>
> Currently FileSystem.Statistics exposes the following statistics:
> BytesRead
> BytesWritten
> ReadOps
> LargeReadOps
> WriteOps
> These are in-turn exposed as job counters by MapReduce and other frameworks.
> There is logic within DfsClient to map operations to these counters that can
> be confusing, for instance, mkdirs counts as a writeOp.
> Proposed enhancement:
> Add a statistic for each DfsClient operation including create, append,
> createSymlink, delete, exists, mkdirs, rename and expose them as new
> properties on the Statistics object. The operation-specific counters can be
> used for analyzing the load imposed by a particular job on HDFS.
> For example, we can use them to identify jobs that end up creating a large
> number of files.
> Once this information is available in the Statistics object, the app
> frameworks like MapReduce can expose them as additional counters to be
> aggregated and recorded as part of job summary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)