hmm, I'm not sure what you do propose to link it to spark sinks but S3AInstrumentation.getMetricSystem().allSources for hadoop-aws and MetricsPublisher for iceberg are the "least worse" solution I came with. Clearly dirty but more efficient than reinstrumenting the whole stack everywhere (pull vs push mode).
Do you mean I should wrap everything to read the thread local every time and maintain the registry in spark metricssystem? Another way to see it is to open JMX when using hadoop-aws, these are the graphs I want to get into grafana at some point. Romain Manni-Bucau @rmannibucau <https://x.com/rmannibucau> | .NET Blog <https://dotnetbirdie.github.io/> | Blog <https://rmannibucau.github.io/> | Old Blog <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book <https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064> Javaccino founder (Java/.NET service - contact via linkedin) Le jeu. 12 févr. 2026 à 19:19, Steve Loughran <[email protected]> a écrit : > > ok, stream level. > > No, it's not the same. > > For those s3a input stream stats, you don't need to go into the s3a > internals > 1. every source of IOStats implements InputStreamStatistics, which is > hadoop-common code > 2. in close() s3a input streams update thread level IOStatisticsContext ( > https://issues.apache.org/jira/browse/HADOOP-17461 ... some stabilisation > so use with Hadoop 3.4.0/Spark 4.0+) > > The thread stuff is so streams opened and closed in, say, the parquet > reader, update stats just for that worker thread even though you never get > near the stream instance itself. > > Regarding iceberg fileio stats, well, maybe someone could add it to the > classes. Spark 4+ could think about collecting the stats for each task and > aggregating, as that was the original goal. You get that aggregation > indirectly on s3a with the s3a committers, similar through abfs, but really > spark should just collect and report it itself. > > > On Thu, 12 Feb 2026 at 17:03, Romain Manni-Bucau <[email protected]> > wrote: > >> Hi Steve, >> >> Do you reference org.apache.iceberg.io.FileIOMetricsContext and >> org.apache.hadoop.fs.FileSystem.Statistics.StatisticsData? It misses most >> of what I'm looking for (429 to cite a single one). >> software.amazon.awssdk.metrics helps a bit but is not sink friendly. >> Compared to hadoop-aws usage combining iceberg native and aws s3 client >> ones kind of compensate the lack but what I would love to see >> is org.apache.hadoop.fs.s3a.S3AInstrumentation and more particularly >> org.apache.hadoop.fs.s3a.S3AInstrumentation.InputStreamStatistics#InputStreamStatistics >> (I'm mainly reading for my use cases). >> >> >> Romain Manni-Bucau >> @rmannibucau <https://x.com/rmannibucau> | .NET Blog >> <https://dotnetbirdie.github.io/> | Blog <https://rmannibucau.github.io/> | >> Old Blog <http://rmannibucau.wordpress.com> | Github >> <https://github.com/rmannibucau> | LinkedIn >> <https://www.linkedin.com/in/rmannibucau> | Book >> <https://www.packtpub.com/en-us/product/java-ee-8-high-performance-9781788473064> >> Javaccino founder (Java/.NET service - contact via linkedin) >> >> >> Le jeu. 12 févr. 2026 à 15:50, Steve Loughran <[email protected]> a >> écrit : >> >>> >>> >>> On Thu, 12 Feb 2026 at 10:39, Romain Manni-Bucau <[email protected]> >>> wrote: >>> >>>> Hi all, >>>> >>>> Is it intended that S3FileIO doesn't wire [aws >>>> sdk].ClientOverrideConfiguration.Builder#addMetricPublisher so basically, >>>> compared to hadoop-aws you can't retrieve metrics from Spark (or any other >>>> engine) and send them to a collector in a centralized manner? >>>> Is there another intended way? >>>> >>> >>> already a PR up awaiting review by committers >>> https://github.com/apache/iceberg/pull/15122 >>> >>> >>> >>>> >>>> For plain hadoop-aws the workaround is to use (by reflection) >>>> S3AInstrumentation.getMetricsSystem().allSources() and wire it to a >>>> spark sink. >>>> >>> >>> The intended way to do it there is to use the IOStatistics API which not >>> only lets you at the s3a stats, google cloud collects stuff the same way, >>> and there's explicit collection of things per thread for stream read and >>> write.... >>> >>> try setting >>> >>> fs.iostatistics.logging.level info >>> >>> to see what gets measured >>> >>> >>>> To be clear I do care about the byte written/read but more importantly >>>> about the latency, number of requests, statuses etc. The stats exposed >>>> through FileSystem in iceberg are < 10 whereas we should get >> 100 stats >>>> (taking hadoop as a ref). >>>> >>> >>> AWS metrics are a very limited sets >>> >>> software.amazon.awssdk.core.metrics.CoreMetric >>> >>> The retry count is good here as it measures stuff beneath any >>> application code. With the rest signer, it'd make sense to also collect >>> signing time, as the RPC call to the signing endpoint would be included. >>> >>> -steve >>> >>
