Hi Nori,

I strongly support this proposal. Adding a vendor-neutral observability
pathway via OpenTelemetry into iceberg-core is a significant improvement
over relying solely on LoggingMetricsReporter or catalog-specific
implementations like the RESTMetricsReporter .

I reviewed PR #16250 in detail and wanted to share a few observations:

*1. Dependency Isolation * The decision to declare opentelemetry-api as a
compileOnly dependency is an excellent design choice . By requiring the
host application (Spark, Flink, Trino, etc.) to provide the SDK at runtime,
the reporter has zero runtime footprint when OTel isn't on the classpath .
In that scenario, GlobalOpenTelemetry simply returns a no-op instance and
metric calls are silently dropped. This cleanly avoids the classloading
conflicts (such as NoClassDefFoundError or version mismatches in shaded
jars) that frequently plague large distributed systems with deep dependency
trees .

*2. Metric Cardinality Controls * I noticed that iceberg.snapshot.id is
attached as an attribute on every scan and commit metric. Since snapshot
IDs are monotonically increasing and unique per commit, this creates a new
time series for every single snapshot. This is an unbounded cardinality
dimension that can trigger state explosions in backends like Prometheus,
Datadog, or Grafana Cloud, potentially consuming a large portion of a
user's observability budget while degrading query performance.

I suggest one of the following mitigations:

   -

   Move snapshot.id to exemplars or trace span attributes rather than
   metric attributes, keeping it available for drill-down without inflating
   cardinality .
   -

   Make it opt-in via a catalog property (e.g.,
   otel-metrics-include-snapshot-id=false by default).
   -

   At a minimum, document the cardinality implications clearly and
   recommend that users leverage OpenTelemetry Collector processors (like the
   groupbyattrs or cardinality limiter) to enforce limits in production .

Bounded attributes like iceberg.table.name and iceberg.operation are
absolutely fine, as those have naturally limited domains .

*3. Combined Reporter Testing * The existing unit tests cover ScanReport,
CommitReport, null safety, accumulation, and no-op fallback well. One gap
I'd like to see addressed is a test that exercises
MetricsReporters.combine(otelReporter,
otherReporter) to verify that both reporters receive reports in tandem .

There are pre-existing edge cases in the Iceberg metrics framework for
example, during BaseTransaction initialization, custom reporters have
occasionally failed to receive a CommitReport, with the framework falling
back to LoggingMetricsReporter. While this PR doesn't introduce that issue,
adding a targeted integration test that verifies the OTel reporter receives
both ScanReport and CommitReport through the combine(..) path during
transactional commits would strengthen confidence and prevent silent
regressions as the framework evolves.

Overall, this is an outstanding addition to the project. The implementation
is clean, the no-op degradation is elegant, and the timing is right.
Looking forward to seeing this land!

Best regards,

Viquar Khan

https://www.linkedin.com/in/vaquar-khan-b695577/

On Fri, 22 May 2026 at 17:10, Talat Uyarer via dev <[email protected]>
wrote:

> I also support this PR. Iceberg has good metrics and I agree @Yufei Gu
> <[email protected]> catalog is not right place for that.
>
> On Fri, May 22, 2026 at 2:48 PM Ryan Blue <[email protected]> wrote:
>
>> I think it is fine to add `opentelemetry-api` as `compileOnly`. We now
>> have checks in place that should prevent future dependency leaks into our
>> runtime Jars.
>>
>> My only concern is making sure that this is usable. I'm not familiar with
>> how OpenTelemetry works, but if this is expected to be bundled into
>> applications that use it then we generally want to meet those assumptions.
>> For libraries with multiple back-ends (like logging, for example) we would
>> normally include the API so that we can load classes that use it, but not
>> the specific back-end (log4j, logback, etc.). And when that API is very
>> commonly distributed (like slf4j) we usually don't include it. We just want
>> to avoid surprise failures for end users by meeting their expectations. (As
>> long as it doesn't require bundling the whole world)
>>
>> Ryan
>>
>> On Thu, May 21, 2026 at 10:22 AM Yufei Gu <[email protected]> wrote:
>>
>>> Hi Noritaka,
>>>
>>> Thanks for writing this up. I think this is a good direction overall.
>>> Using OpenTelemetry as a vendor neutral integration layer makes sense here.
>>> It gives Iceberg clients a standard way to integrate with modern
>>> observability platforms without adding per backend integrations into the
>>> project. Moreover, even for REST catalogs, the catalog itself is usually
>>> not the right place to serve as a full metrics monitoring platform. In
>>> practice, metrics still need to flow into systems like Prometheus,
>>> CloudWatch, Datadog, or Grafana Cloud for storage, dashboards, alerting,
>>> and correlation. The main extra value a REST catalog could provide is
>>> enrichment with catalog specific metadata, but that feels like a relatively
>>> minor benefit compared to having a standard observability integration path
>>> overall.
>>>
>>> Yufei
>>>
>>>
>>> On Wed, May 20, 2026 at 6:39 PM Noritaka Sekiyama via dev <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'd like to propose adding an OpenTelemetry-based MetricsReporter to
>>>> iceberg-core that exports ScanReport and CommitReport to any 
>>>> OTLP-compatible
>>>> backend.
>>>>
>>>> # Background
>>>> Iceberg ships three built-in MetricsReporter implementations today:
>>>> LoggingMetricsReporter, InMemoryMetricsReporter (Spark-internal), and
>>>> RESTMetricsReporter (REST catalog only).
>>>> None of them give users an out-of-the-box way to ship scan/commit
>>>> metrics to an external observability platform.
>>>> The gap applies to Spark users on non-REST catalogs and to all
>>>> non-Spark engines (Trino, Flink, etc.).
>>>>
>>>> # Motivation
>>>> OpenTelemetry is the vendor-neutral CNCF standard for telemetry,
>>>> supported by every major observability backend (Prometheus, CloudWatch,
>>>> Datadog, Grafana Cloud, etc.).
>>>> A single OTLP-based MetricsReporter in Iceberg lets users reach all of
>>>> these without per-vendor integrations in the project.
>>>> This is complementary to #14360, which adds OTel support to HTTPClient
>>>> at the REST-catalog HTTP layer; this proposal covers the Iceberg-level
>>>> ScanReport / CommitReport layer.
>>>>
>>>> # Proposal
>>>> Issue: https://github.com/apache/iceberg/issues/16169
>>>> PR:    https://github.com/apache/iceberg/pull/16250
>>>>
>>>> The reporter follows the same SDK-ownership philosophy as #14360 - the
>>>> host application (Spark/Flink/Trino/...) registers an OpenTelemetrySdk via
>>>> GlobalOpenTelemetry, and the reporter just looks up a Meter from it.
>>>> The reporter has zero Iceberg-specific catalog properties; everything
>>>> else is owned by the host.
>>>>
>>>> The PR has been validated end-to-end against two unrelated OTLP
>>>> backends (Databricks Zerobus and Amazon CloudWatch) - full procedures and
>>>> queries are linked from the PR.
>>>>
>>>> # On dependencies
>>>> Given the current sensitivity around new runtime dependencies in 1.11,
>>>> the PR adds only opentelemetry-api to iceberg-core as compileOnly.
>>>> The OpenTelemetry SDK and OTLP exporters are not added to the runtime 
>>>> classpath
>>>> - they come from the host application.
>>>> opentelemetry-sdk / -sdk-testing are testImplementation only.
>>>>
>>>> # Questions for the community
>>>>
>>>> Q1. Any objection to taking the opentelemetry-api compileOnly
>>>> dependency in iceberg-core?
>>>> Q2. Module placement: iceberg-core (current PR), or a
>>>> separate iceberg-opentelemetry module?
>>>>
>>>> Thanks,
>>>> Noritaka Sekiyama, Databricks
>>>>
>>>

Reply via email to