Hi Nori, Thanks for the details. I agree that it is reasonable to start with metrics (using gauges, etc.). We can consider enhancing this with spans in a future step, as that likely falls more on the query engine side than on Iceberg directly.
The documentation you added looks great and is exactly what I had in mind. Thanks! Regards, JB On Wed, May 27, 2026 at 3:02 AM Noritaka Sekiyama <[email protected]> wrote: > > Hi Grant, and all, > > Thanks for sharing the data point — cardinality from per-table attributes is > exactly the kind of real-world failure mode the design should account for, > and your experience is fair. > > I pushed commit 5d867e49d to #16250 that addresses this by making the > attribute set configurable and giving users more control over cardinality. A > new catalog property iceberg.otel.metrics.attributes accepts a > comma-separated allowlist of attribute short names (table-name, schema-id, > operation). Attributes whose short names are not listed are omitted from > emitted metric points. The default attribute set is table-name and operation; > schema-id is opt-in. Workloads with thousands of tables can flip table-name > off and keep operation-level aggregates when it is preferred. > > For users who want to keep iceberg.table.name but only for a subset of > tables, I've also filed #16573 to propose a framework-level table-name filter > that would apply uniformly across all MetricsReporter implementations — > complementary to the per-reporter attribute pruning above. This would also > address your concern. > > On the span-based reporter suggestion: I took some time to think through > whether it makes sense to layer that into this PR or as a sibling reporter > alongside OtelMetricsReporter. I'd like to defer it, mainly because emitting > OpenTelemetry spans through the MetricsReporter callback feels semantically > off — MetricsReporter fires after the operation has finished, so the reporter > would have to synthesize spans retroactively from the report's duration > rather than open and close them at the real operation boundaries, and the > class name MetricsReporter emitting traces is itself a friction point. The > natural home for span-based observability is probably an Iceberg-side > instrumentation hook in the scan planner / commit code paths that opens spans > at the real boundaries, which is a larger design discussion that I'd want to > handle as a separate Issue / PR rather than bolting onto this one. > > For #16250 specifically, my preference is to keep it as a metrics-only > reporter with the control above. > > Thanks, > Nori > > On Tue, May 26, 2026 at 1:14 AM Grant Nicholas > <[email protected]> wrote: >> >> +1 with OTEL implementation of MetricsReporter, but have you considered a >> span-based implementation instead of/in addition to a metrics-based >> implementation? >> >> High cardinality metrics should be avoided and (schema_name, table_name) >> attributes can be high cardinality depending on your workload. Spans do not >> have problems with high cardinality. >> >> For context, we built a metrics-based MetricsReporter, ran into high >> cardinality cost issues with thousands of tables, then switched to a >> span-based MetricsReporter. >> >> On Mon, May 25, 2026 at 2:08 AM Noritaka Sekiyama via dev >> <[email protected]> wrote: >>> >>> Hi JB, and all, >>> >>> Thanks for the suggestion. Pushed efc48d429 which adds an >>> OtelMetricsReporter section to docs/docs/metrics-reporting.md. It documents >>> the host's responsibility for packaging the OpenTelemetry API, SDK, and a >>> metric exporter (Gradle plus a spark-submit --packages example), the >>> programmatic SDK registration path, exporter-wiring examples for the >>> OpenTelemetry Collector, Prometheus (pull and push), and Amazon CloudWatch >>> via the sigv4auth Collector extension, plus the emitted metric names and >>> attribute set. >>> >>> Verified end-to-end against the Prometheus pull pattern from the docs (host >>> SDK with PrometheusHttpServer + OtelMetricsReporter reporting synthetic >>> ScanReport/CommitReport, all 12 iceberg.* series visible on /metrics with >>> the documented attribute set); each Collector YAML in the docs was >>> otelcol-contrib validate-checked. >>> >>> Looking forward to your PR review. >>> >>> Thanks, >>> Nori >>> >>> On Mon, May 25, 2026 at 3:00 PM Jean-Baptiste Onofré <[email protected]> >>> wrote: >>>> >>>> Hi, >>>> >>>> I think this is a great proposal. >>>> >>>> I would suggest documenting how to package the exporter, as I believe it >>>> is up to the user to package the specific OpenTelemetry exporter they need. >>>> >>>> I will take a look at the PR. >>>> >>>> Regards, >>>> JB >>>> >>>> On Thu, May 21, 2026 at 3:39 AM Noritaka Sekiyama via dev >>>> <[email protected]> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I'd like to propose adding an OpenTelemetry-based MetricsReporter to >>>>> iceberg-core that exports ScanReport and CommitReport to any >>>>> OTLP-compatible backend. >>>>> >>>>> # Background >>>>> Iceberg ships three built-in MetricsReporter implementations today: >>>>> LoggingMetricsReporter, InMemoryMetricsReporter (Spark-internal), and >>>>> RESTMetricsReporter (REST catalog only). >>>>> None of them give users an out-of-the-box way to ship scan/commit metrics >>>>> to an external observability platform. >>>>> The gap applies to Spark users on non-REST catalogs and to all non-Spark >>>>> engines (Trino, Flink, etc.). >>>>> >>>>> # Motivation >>>>> OpenTelemetry is the vendor-neutral CNCF standard for telemetry, >>>>> supported by every major observability backend (Prometheus, CloudWatch, >>>>> Datadog, Grafana Cloud, etc.). >>>>> A single OTLP-based MetricsReporter in Iceberg lets users reach all of >>>>> these without per-vendor integrations in the project. >>>>> This is complementary to #14360, which adds OTel support to HTTPClient at >>>>> the REST-catalog HTTP layer; this proposal covers the Iceberg-level >>>>> ScanReport / CommitReport layer. >>>>> >>>>> # Proposal >>>>> Issue: https://github.com/apache/iceberg/issues/16169 >>>>> PR: https://github.com/apache/iceberg/pull/16250 >>>>> >>>>> The reporter follows the same SDK-ownership philosophy as #14360 - the >>>>> host application (Spark/Flink/Trino/...) registers an OpenTelemetrySdk >>>>> via GlobalOpenTelemetry, and the reporter just looks up a Meter from it. >>>>> The reporter has zero Iceberg-specific catalog properties; everything >>>>> else is owned by the host. >>>>> >>>>> The PR has been validated end-to-end against two unrelated OTLP backends >>>>> (Databricks Zerobus and Amazon CloudWatch) - full procedures and queries >>>>> are linked from the PR. >>>>> >>>>> # On dependencies >>>>> Given the current sensitivity around new runtime dependencies in 1.11, >>>>> the PR adds only opentelemetry-api to iceberg-core as compileOnly. >>>>> The OpenTelemetry SDK and OTLP exporters are not added to the runtime >>>>> classpath - they come from the host application. >>>>> opentelemetry-sdk / -sdk-testing are testImplementation only. >>>>> >>>>> # Questions for the community >>>>> >>>>> Q1. Any objection to taking the opentelemetry-api compileOnly dependency >>>>> in iceberg-core? >>>>> Q2. Module placement: iceberg-core (current PR), or a separate >>>>> iceberg-opentelemetry module? >>>>> >>>>> Thanks, >>>>> Noritaka Sekiyama, Databricks
