hmm, I'm not sure what you do propose to link it to spark sinks but
S3AInstrumentation.getMetricSystem().allSources for hadoop-aws and
MetricsPublisher for iceberg are the "least worse" solution I came with.
Clearly dirty but more efficient than reinstrumenting the whole stack
everywhere (pull vs push mode).

Do you mean I should wrap everything to read the thread local every time
and maintain the registry in spark metricssystem?

Another way to see it is to open JMX when using hadoop-aws, these are the
graphs I want to get into grafana at some point.

Romain Manni-Bucau
@rmannibucau <https://x.com/rmannibucau> | .NET Blog
<https://dotnetbirdie.github.io/> | Blog <https://rmannibucau.github.io/> | Old
Blog <http://rmannibucau.wordpress.com> | Github
<https://github.com/rmannibucau> | LinkedIn
<https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064>
Javaccino founder (Java/.NET service - contact via linkedin)


Le jeu. 12 févr. 2026 à 19:19, Steve Loughran <[email protected]> a
écrit :

>
> ok, stream level.
>
> No, it's not the same.
>
> For those s3a input stream stats, you don't need to go into the s3a
> internals
> 1. every source of IOStats implements InputStreamStatistics, which is
> hadoop-common code
> 2. in close() s3a input streams update thread level IOStatisticsContext (
> https://issues.apache.org/jira/browse/HADOOP-17461 ... some stabilisation
> so use with Hadoop 3.4.0/Spark 4.0+)
>
> The thread stuff is so streams opened and closed in, say, the parquet
> reader, update stats just for that worker thread even though you never get
> near the stream instance itself.
>
> Regarding iceberg fileio stats, well, maybe someone could add it to the
> classes. Spark 4+ could think about collecting the stats for each task and
> aggregating, as that was the original goal. You get that aggregation
> indirectly on s3a with the s3a committers, similar through abfs, but really
> spark should just collect and report it itself.
>
>
> On Thu, 12 Feb 2026 at 17:03, Romain Manni-Bucau <[email protected]>
> wrote:
>
>> Hi Steve,
>>
>> Do you reference org.apache.iceberg.io.FileIOMetricsContext and
>> org.apache.hadoop.fs.FileSystem.Statistics.StatisticsData? It misses most
>> of what I'm looking for (429 to cite a single one).
>> software.amazon.awssdk.metrics helps a bit but is not sink friendly.
>> Compared to hadoop-aws usage combining iceberg native and aws s3 client
>> ones kind of compensate the lack but what I would love to see
>> is org.apache.hadoop.fs.s3a.S3AInstrumentation and more particularly
>> org.apache.hadoop.fs.s3a.S3AInstrumentation.InputStreamStatistics#InputStreamStatistics
>> (I'm mainly reading for my use cases).
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau <https://x.com/rmannibucau> | .NET Blog
>> <https://dotnetbirdie.github.io/> | Blog <https://rmannibucau.github.io/> |
>> Old Blog <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064>
>> Javaccino founder (Java/.NET service - contact via linkedin)
>>
>>
>> Le jeu. 12 févr. 2026 à 15:50, Steve Loughran <[email protected]> a
>> écrit :
>>
>>>
>>>
>>> On Thu, 12 Feb 2026 at 10:39, Romain Manni-Bucau <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Is it intended that S3FileIO doesn't wire [aws
>>>> sdk].ClientOverrideConfiguration.Builder#addMetricPublisher so basically,
>>>> compared to hadoop-aws you can't retrieve metrics from Spark (or any other
>>>> engine) and send them to a collector in a centralized manner?
>>>> Is there another intended way?
>>>>
>>>
>>> already a PR up awaiting review by committers
>>> https://github.com/apache/iceberg/pull/15122
>>>
>>>
>>>
>>>>
>>>> For plain hadoop-aws the workaround is to use (by reflection)
>>>> S3AInstrumentation.getMetricsSystem().allSources() and wire it to a
>>>> spark sink.
>>>>
>>>
>>> The intended way to do it there is to use the IOStatistics API which not
>>> only lets you at the s3a stats, google cloud collects stuff the same way,
>>> and there's explicit collection of things per thread for stream read and
>>> write....
>>>
>>> try setting
>>>
>>> fs.iostatistics.logging.level info
>>>
>>> to see what gets measured
>>>
>>>
>>>> To be clear I do care about the byte written/read but more importantly
>>>> about the latency, number of requests, statuses etc. The stats exposed
>>>> through FileSystem in iceberg are < 10 whereas we should get >> 100 stats
>>>> (taking hadoop as a ref).
>>>>
>>>
>>> AWS metrics are a very limited sets
>>>
>>> software.amazon.awssdk.core.metrics.CoreMetric
>>>
>>> The retry count is good here as it measures stuff beneath any
>>> application code. With the rest signer, it'd make sense to also collect
>>> signing time, as the RPC call to the signing endpoint would be included.
>>>
>>> -steve
>>>
>>

Reply via email to