[jira] [Commented] (HDFS-10175) add per-operation stats to FileSystem.Statistics

Steve Loughran (JIRA) Tue, 26 Apr 2016 05:45:37 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258012#comment-15258012
 ]


Steve Loughran commented on HDFS-10175:
---------------------------------------

One piece of background here is what I'm currently exploring in terms of 
[Metrics-first 
testing|http://steveloughran.blogspot.co.uk/2016/04/distributed-testing-making-use-of.html],
 that is, instrumenting code and using it's observed state in both unit and 
system tests. I've done in in Slider (SLIDER-82) and spark (SPARK-7889) and 
found it highly effective for writing deterministic tests which also provide 
information parseable by test runners for better analysis of test runs.

bq. I understand your eagerness to get the s3 stats in, but I would rather not 
proliferate more statistics interfaces if possible. Once they're in, we really 
can't get rid of them, and it becomes very confusing and clunky.

I see that, but stream-level counters are essential at least for the tests 
which verify forward and lazy seeks. Which means that yes, they do have to go 
into the 2.8.0 release. What I can do is set up the scope so that they are 
package private, then, in the test code, implement the assertions about 
metric-derived state into that package. 


Regarding the metrics2 instrumentation in HADOOP-13028, I'm aggregating the 
stream statistics back into the metrics 2 data. That's something which isn't 
needed for the hadoop tests, but which I'm logging in spark test runs, such as 
(formatted for readability):

{code}
2016-04-26 12:08:25,901  executor.Executor Running task 0.0 in stage 0.0 (TID 0)
2016-04-26 12:08:25,924  rdd.HadoopRDD Input split: 
s3a://landsat-pds/scene_list.gz:0+20430493
2016-04-26 12:08:26,107  compress.CodecPool - Got brand-new decompressor [.gz]
2016-04-26 12:08:32,304  executor.Executor Finished task 0.0 in stage 0.0 (TID 
0). 
  2643 bytes result sent to driver
2016-04-26 12:08:32,311  scheduler.TaskSetManager Finished task 0.0 in stage 
0.0 (TID 0)
  in 6434 ms on localhost (1/1)
2016-04-26 12:08:32,312  scheduler.TaskSchedulerImpl Removed TaskSet 0.0, whose 
tasks
  have all completed, from pool 
2016-04-26 12:08:32,315  scheduler.DAGScheduler ResultStage 0 finished in 6.447 
s
2016-04-26 12:08:32,319  scheduler.DAGScheduler Job 0 finished took 6.560166 s
2016-04-26 12:08:32,320  s3.S3aIOSuite  size of s3a://landsat-pds/scene_list.gz 
= 464105
  rows read in 6779125000 nS

2016-04-26 12:08:32,324 s3.S3aIOSuite Filesystem statistics
  S3AFileSystem{uri=s3a://landsat-pds,
  workingDir=s3a://landsat-pds/user/stevel,
  partSize=104857600, enableMultiObjectsDelete=true,
  multiPartThreshold=2147483647,
  statistics {
    20430493 bytes read,
     0 bytes written,
     3 read ops,
     0 large read ops,
     0 write ops},
     metrics {{Context=S3AFileSystem}
      {FileSystemId=29890500-aed6-4eb8-bb47-0c896a66aac2-landsat-pds}
      {fsURI=s3a://landsat-pds/scene_list.gz}
      {streamOpened=1}
      {streamCloseOperations=1}
      {streamClosed=1}
      {streamAborted=0}
      {streamSeekOperations=0}
      {streamReadExceptions=0}
      {streamForwardSeekOperations=0}
      {streamBackwardSeekOperations=0}
      {streamBytesSkippedOnSeek=0}
      {streamBytesRead=20430493}
      {streamReadOperations=1488}
      {streamReadFullyOperations=0}
      {streamReadOperationsIncomplete=1488}
      {files_created=0}
      {files_copied=0}
      {files_copied_bytes=0}
      {files_deleted=0}
      {directories_created=0}
      {directories_deleted=0}
      {ignored_errors=0} 
      }}
{code}

The spark code isn't accessing these metrics, though it could if it tried hard 
(went to the metric registry).

It's publishing those stream level operations which I think you are most 
concerned about; the other metrics are roughly a subset of those already in 
Azure's metrics2 instrumentation. Accordingly, I will modify the S3 
instrumentation to *not* register the stream operations as metrics2 counters, 
retaining them internally and in the toString value.

I hope that's enough to satisfy your concerns while still retaining the 
information I need in s3a functionality and testing


> add per-operation stats to FileSystem.Statistics
> ------------------------------------------------
>
>                 Key: HDFS-10175
>                 URL: https://issues.apache.org/jira/browse/HDFS-10175
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>            Reporter: Ram Venkatesh
>            Assignee: Mingliang Liu
>         Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch, 
> HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch, 
> HDFS-10175.005.patch, HDFS-10175.006.patch, TestStatisticsOverhead.java
>
>
> Currently FileSystem.Statistics exposes the following statistics:
> BytesRead
> BytesWritten
> ReadOps
> LargeReadOps
> WriteOps
> These are in-turn exposed as job counters by MapReduce and other frameworks. 
> There is logic within DfsClient to map operations to these counters that can 
> be confusing, for instance, mkdirs counts as a writeOp.
> Proposed enhancement:
> Add a statistic for each DfsClient operation including create, append, 
> createSymlink, delete, exists, mkdirs, rename and expose them as new 
> properties on the Statistics object. The operation-specific counters can be 
> used for analyzing the load imposed by a particular job on HDFS. 
> For example, we can use them to identify jobs that end up creating a large 
> number of files.
> Once this information is available in the Statistics object, the app 
> frameworks like MapReduce can expose them as additional counters to be 
> aggregated and recorded as part of job summary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-10175) add per-operation stats to FileSystem.Statistics

Reply via email to