[ 
https://issues.apache.org/jira/browse/HDFS-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15236382#comment-15236382
 ] 

Mingliang Liu commented on HDFS-10175:
--------------------------------------

Thank you very much [~mingma] for the comments.

Yes NN also has audit log which tracks DFS operations. The missing DFSClient 
name makes it less useful for the user. Your point about the motivation and map 
vs. inline log implementation of [HDFS-9579] made a lot of sense to me. I see 
no performance benefit from map approach. I was wondering whether using a 
composite (e.g. enum map, array) data structure to manage the 
distance->bytesRead mapping makes the code simpler.
0) {{StatisticsData}} should be a bit shorter by delegating the operations to 
the composite data structure.
1) The {{incrementBytesReadByDistance(int distance, long newBytes)}} and 
{{getBytesReadByDistance(int distance)}} which switch-case all hard-code 
variables, may be simplified as we can set/get the bytesRead by distance 
directly from map/array.
2) Move {{long getBytesReadByDistance(int distance)}} from {{Statistics}} to 
{{StatisticsData}}. If the user is getting bytes read for all distances, she 
can call getData() and then iterate the map/array, in which case the getData() 
will be called only once. For cases of 1K client threads, this may save the 
effort of aggregation.
[~cmccabe] may have different comments about this?

For newly supported APIs, adding an entry in the map and one line of increment 
in the new method will make the magic done. From the point of file system APIs, 
its public methods are not evolving rapidly. Another dimension will be needed 
for cross-DC analysis, while based on the current use case, I don't think this 
dimension is heavily needed. One point is that, all the same kind of file 
system is sharing the statistic data among threads, regardless of the 
remote/local HDFS clusters.

[~jnp] also suggested to make this feature optional/configurable. I found it 
hard largely because different file system object has individual 
configurations. As they share the statistic data by FS class type (scheme), it 
will be confusing if one FS disables this feature and another FS enables this 
features. Is there any easy approach to handling this case?

p.s. the v4 patch is to address the {{HAS_NEXT}}/{{LIST_STATUS}}.

> add per-operation stats to FileSystem.Statistics
> ------------------------------------------------
>
>                 Key: HDFS-10175
>                 URL: https://issues.apache.org/jira/browse/HDFS-10175
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>            Reporter: Ram Venkatesh
>            Assignee: Mingliang Liu
>         Attachments: HDFS-10175.000.patch, HDFS-10175.001.patch, 
> HDFS-10175.002.patch, HDFS-10175.003.patch, HDFS-10175.004.patch, 
> TestStatisticsOverhead.java
>
>
> Currently FileSystem.Statistics exposes the following statistics:
> BytesRead
> BytesWritten
> ReadOps
> LargeReadOps
> WriteOps
> These are in-turn exposed as job counters by MapReduce and other frameworks. 
> There is logic within DfsClient to map operations to these counters that can 
> be confusing, for instance, mkdirs counts as a writeOp.
> Proposed enhancement:
> Add a statistic for each DfsClient operation including create, append, 
> createSymlink, delete, exists, mkdirs, rename and expose them as new 
> properties on the Statistics object. The operation-specific counters can be 
> used for analyzing the load imposed by a particular job on HDFS. 
> For example, we can use them to identify jobs that end up creating a large 
> number of files.
> Once this information is available in the Statistics object, the app 
> frameworks like MapReduce can expose them as additional counters to be 
> aggregated and recorded as part of job summary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to