[
https://issues.apache.org/jira/browse/HADOOP-13453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834359#comment-15834359
]
Steve Loughran commented on HADOOP-13453:
-----------------------------------------
Hi, don't worry about asking questions, we'll do our best to get you
contributing code —it benefits all of us if you are adding code to Hadoop.
The split between low level increment named counter and more elegant "event
with internal counters?". The event ones are cleaner, as they stop the rest of
the code having to know exactly which counters/gauges to use. Consider the
elegant ones the best approach, and the direct invocation us being lazy.
The S3aInstrumentation class also has a set of explicit named counters
"filesDeleted" as well as lots of ones that are only listed in the arrays
{{GAUGES_TO_CREATE}} and {{COUNTERS_TO_CREATE}}. That's evolution over time; I
got bored of having to name and register lots of fields, and realised I could
do it from the arrays, at the cost of a hash lookup on every increment.
Outside the S3a class itself, i've tried to have external inner classes to do
the counting, with the results merged in at the end (example: the input and
output streams), with the inner classes using simple long values, rather than
atomics. Why? Eliminates any delays during increments, and lets us override the
toString() values for input/output streams with dumps of the values (go on, try
it!). We can have many input/output streams per FS instance, so the risk of
contention for atomic int/log values is potentially quite high.
I think for s3guard we could add a new inner class passed in to each s3guard
instance; it would export the various methods for events that s3guard could
raise, such as {{tableCreated()}}, {{tableDeleted()}} —these can directly
increment the atomic counters in the instrumentation, as we'd only have a 1:1
map of S3aFS instance and a s3guard store instance.
Regarding access the statistics, that's hooked up to
{{FileSystem.getStorageStatistics()}}, which is intended to provide the storage
stats for any FS; s3a and HDFS share common statistic names for the common
statistics. The latest versions of Tez do collect the statistics of jobs, and
so give you the aggregate statistics across your entire query. Until now, only
{{Filesystem.getStatistics()}} has been used, which returns a fixed set of
values (bytes read/written, etc). Spark still only collects those; it'd take
some migration to hadoop 2.8+ to pick up the new data. Until then, it's
something we can use in tests.
> S3Guard: Instrument new functionality with Hadoop metrics.
> ----------------------------------------------------------
>
> Key: HADOOP-13453
> URL: https://issues.apache.org/jira/browse/HADOOP-13453
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Reporter: Chris Nauroth
> Assignee: Ai Deng
>
> Provide Hadoop metrics showing operational details of the S3Guard
> implementation.
> The metrics will be implemented in this ticket:
> ● S3GuardRechecksNthPercentileLatency (MutableQuantiles) Percentile time
> spent
> in rechecks attempting to achieve consistency. Repeated for multiple
> percentile values
> of N. This metric is an indicator of the additional latency cost of running
> S3A with
> S3Guard.
> ● S3GuardRechecksNumOps (MutableQuantiles) Number of times a consistency
> recheck was required while attempting to achieve consistency.
> ● S3GuardStoreNthPercentileLatency (MutableQuantiles) Percentile time
> spent in
> operations against the consistent store, including both write operations
> during file system
> mutations and read operations during file system consistency checks. Repeated
> for
> multiple percentile values of N. This metric is an indicator of latency to
> the consistent
> store implementation.
> ● S3GuardConsistencyStoreNumOps (MutableQuantiles) Number of operations
> against the consistent store, including both write operations during file
> system mutations
> and read operations during file system consistency checks.
> ● S3GuardConsistencyStoreFailures (MutableCounterLong) Number of failures
> during operations against the consistent store implementation.
> ● S3GuardConsistencyStoreTimeouts (MutableCounterLong) Number of timeouts
> during operations against the consistent store implementation.
> ● S3GuardInconsistencies (MutableCounterLong) C ount of times S3Guard
> failed to
> achieve consistency, even after exhausting all rechecks. A high count may
> indicate
> unexpected outofband modification of the S3 bucket contents, such as by an
> external
> tool that does not make corresponding updates to the consistent store.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]