[
https://issues.apache.org/jira/browse/HDFS-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258012#comment-15258012
]
Steve Loughran commented on HDFS-10175:
---------------------------------------
One piece of background here is what I'm currently exploring in terms of
[Metrics-first
testing|http://steveloughran.blogspot.co.uk/2016/04/distributed-testing-making-use-of.html],
that is, instrumenting code and using it's observed state in both unit and
system tests. I've done in in Slider (SLIDER-82) and spark (SPARK-7889) and
found it highly effective for writing deterministic tests which also provide
information parseable by test runners for better analysis of test runs.
bq. I understand your eagerness to get the s3 stats in, but I would rather not
proliferate more statistics interfaces if possible. Once they're in, we really
can't get rid of them, and it becomes very confusing and clunky.
I see that, but stream-level counters are essential at least for the tests
which verify forward and lazy seeks. Which means that yes, they do have to go
into the 2.8.0 release. What I can do is set up the scope so that they are
package private, then, in the test code, implement the assertions about
metric-derived state into that package.
Regarding the metrics2 instrumentation in HADOOP-13028, I'm aggregating the
stream statistics back into the metrics 2 data. That's something which isn't
needed for the hadoop tests, but which I'm logging in spark test runs, such as
(formatted for readability):
{code}
2016-04-26 12:08:25,901 executor.Executor Running task 0.0 in stage 0.0 (TID 0)
2016-04-26 12:08:25,924 rdd.HadoopRDD Input split:
s3a://landsat-pds/scene_list.gz:0+20430493
2016-04-26 12:08:26,107 compress.CodecPool - Got brand-new decompressor [.gz]
2016-04-26 12:08:32,304 executor.Executor Finished task 0.0 in stage 0.0 (TID
0).
2643 bytes result sent to driver
2016-04-26 12:08:32,311 scheduler.TaskSetManager Finished task 0.0 in stage
0.0 (TID 0)
in 6434 ms on localhost (1/1)
2016-04-26 12:08:32,312 scheduler.TaskSchedulerImpl Removed TaskSet 0.0, whose
tasks
have all completed, from pool
2016-04-26 12:08:32,315 scheduler.DAGScheduler ResultStage 0 finished in 6.447
s
2016-04-26 12:08:32,319 scheduler.DAGScheduler Job 0 finished took 6.560166 s
2016-04-26 12:08:32,320 s3.S3aIOSuite size of s3a://landsat-pds/scene_list.gz
= 464105
rows read in 6779125000 nS
2016-04-26 12:08:32,324 s3.S3aIOSuite Filesystem statistics
S3AFileSystem{uri=s3a://landsat-pds,
workingDir=s3a://landsat-pds/user/stevel,
partSize=104857600, enableMultiObjectsDelete=true,
multiPartThreshold=2147483647,
statistics {
20430493 bytes read,
0 bytes written,
3 read ops,
0 large read ops,
0 write ops},
metrics {{Context=S3AFileSystem}
{FileSystemId=29890500-aed6-4eb8-bb47-0c896a66aac2-landsat-pds}
{fsURI=s3a://landsat-pds/scene_list.gz}
{streamOpened=1}
{streamCloseOperations=1}
{streamClosed=1}
{streamAborted=0}
{streamSeekOperations=0}
{streamReadExceptions=0}
{streamForwardSeekOperations=0}
{streamBackwardSeekOperations=0}
{streamBytesSkippedOnSeek=0}
{streamBytesRead=20430493}
{streamReadOperations=1488}
{streamReadFullyOperations=0}
{streamReadOperationsIncomplete=1488}
{files_created=0}
{files_copied=0}
{files_copied_bytes=0}
{files_deleted=0}
{directories_created=0}
{directories_deleted=0}
{ignored_errors=0}
}}
{code}
The spark code isn't accessing these metrics, though it could if it tried hard
(went to the metric registry).
It's publishing those stream level operations which I think you are most
concerned about; the other metrics are roughly a subset of those already in
Azure's metrics2 instrumentation. Accordingly, I will modify the S3
instrumentation to *not* register the stream operations as metrics2 counters,
retaining them internally and in the toString value.
I hope that's enough to satisfy your concerns while still retaining the
information I need in s3a functionality and testing
> add per-operation stats to FileSystem.Statistics
> ------------------------------------------------
>
> Key: HDFS-10175
> URL: https://issues.apache.org/jira/browse/HDFS-10175
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client
> Reporter: Ram Venkatesh
> Assignee: Mingliang Liu
> Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch,
> HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch,
> HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java
>
>
> Currently FileSystem.Statistics exposes the following statistics:
> BytesRead
> BytesWritten
> ReadOps
> LargeReadOps
> WriteOps
> These are in-turn exposed as job counters by MapReduce and other frameworks.
> There is logic within DfsClient to map operations to these counters that can
> be confusing, for instance, mkdirs counts as a writeOp.
> Proposed enhancement:
> Add a statistic for each DfsClient operation including create, append,
> createSymlink, delete, exists, mkdirs, rename and expose them as new
> properties on the Statistics object. The operation-specific counters can be
> used for analyzing the load imposed by a particular job on HDFS.
> For example, we can use them to identify jobs that end up creating a large
> number of files.
> Once this information is available in the Statistics object, the app
> frameworks like MapReduce can expose them as additional counters to be
> aggregated and recorded as part of job summary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)