Hi Nori,

Thanks for the details. I agree that it is reasonable to start with
metrics (using gauges, etc.). We can consider enhancing this with
spans in a future step, as that likely falls more on the query engine
side than on Iceberg directly.

The documentation you added looks great and is exactly what I had in
mind. Thanks!

Regards,
JB

On Wed, May 27, 2026 at 3:02 AM Noritaka Sekiyama
<[email protected]> wrote:
>
> Hi Grant, and all,
>
> Thanks for sharing the data point — cardinality from per-table attributes is 
> exactly the kind of real-world failure mode the design should account for, 
> and your experience is fair.
>
> I pushed commit 5d867e49d to #16250 that addresses this by making the 
> attribute set configurable and giving users more control over cardinality. A 
> new catalog property iceberg.otel.metrics.attributes accepts a 
> comma-separated allowlist of attribute short names (table-name, schema-id, 
> operation). Attributes whose short names are not listed are omitted from 
> emitted metric points. The default attribute set is table-name and operation; 
> schema-id is opt-in. Workloads with thousands of tables can flip table-name 
> off and keep operation-level aggregates when it is preferred.
>
> For users who want to keep iceberg.table.name but only for a subset of 
> tables, I've also filed #16573 to propose a framework-level table-name filter 
> that would apply uniformly across all MetricsReporter implementations — 
> complementary to the per-reporter attribute pruning above. This would also 
> address your concern.
>
> On the span-based reporter suggestion: I took some time to think through 
> whether it makes sense to layer that into this PR or as a sibling reporter 
> alongside OtelMetricsReporter. I'd like to defer it, mainly because emitting 
> OpenTelemetry spans through the MetricsReporter callback feels semantically 
> off — MetricsReporter fires after the operation has finished, so the reporter 
> would have to synthesize spans retroactively from the report's duration 
> rather than open and close them at the real operation boundaries, and the 
> class name MetricsReporter emitting traces is itself a friction point. The 
> natural home for span-based observability is probably an Iceberg-side 
> instrumentation hook in the scan planner / commit code paths that opens spans 
> at the real boundaries, which is a larger design discussion that I'd want to 
> handle as a separate Issue / PR rather than bolting onto this one.
>
> For #16250 specifically, my preference is to keep it as a metrics-only 
> reporter with the control above.
>
> Thanks,
> Nori
>
> On Tue, May 26, 2026 at 1:14 AM Grant Nicholas 
> <[email protected]> wrote:
>>
>> +1 with OTEL implementation of MetricsReporter, but have you considered a 
>> span-based implementation instead of/in addition to a metrics-based 
>> implementation?
>>
>> High cardinality metrics should be avoided and (schema_name, table_name) 
>> attributes can be high cardinality depending on your workload. Spans do not 
>> have problems with high cardinality.
>>
>> For context, we built a metrics-based MetricsReporter, ran into high 
>> cardinality cost issues with thousands of tables, then switched to a 
>> span-based MetricsReporter.
>>
>> On Mon, May 25, 2026 at 2:08 AM Noritaka Sekiyama via dev 
>> <[email protected]> wrote:
>>>
>>> Hi JB, and all,
>>>
>>> Thanks for the suggestion. Pushed efc48d429 which adds an 
>>> OtelMetricsReporter section to docs/docs/metrics-reporting.md. It documents 
>>> the host's responsibility for packaging the OpenTelemetry API, SDK, and a 
>>> metric exporter (Gradle plus a spark-submit --packages example), the 
>>> programmatic SDK registration path, exporter-wiring examples for the 
>>> OpenTelemetry Collector, Prometheus (pull and push), and Amazon CloudWatch 
>>> via the sigv4auth Collector extension, plus the emitted metric names and 
>>> attribute set.
>>>
>>> Verified end-to-end against the Prometheus pull pattern from the docs (host 
>>> SDK with PrometheusHttpServer + OtelMetricsReporter reporting synthetic 
>>> ScanReport/CommitReport, all 12 iceberg.* series visible on /metrics with 
>>> the documented attribute set); each Collector YAML in the docs was 
>>> otelcol-contrib validate-checked.
>>>
>>> Looking forward to your PR review.
>>>
>>> Thanks,
>>> Nori
>>>
>>> On Mon, May 25, 2026 at 3:00 PM Jean-Baptiste Onofré <[email protected]> 
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I think this is a great proposal.
>>>>
>>>> I would suggest documenting how to package the exporter, as I believe it 
>>>> is up to the user to package the specific OpenTelemetry exporter they need.
>>>>
>>>> I will take a look at the PR.
>>>>
>>>> Regards,
>>>> JB
>>>>
>>>> On Thu, May 21, 2026 at 3:39 AM Noritaka Sekiyama via dev 
>>>> <[email protected]> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to propose adding an OpenTelemetry-based MetricsReporter to 
>>>>> iceberg-core that exports ScanReport and CommitReport to any 
>>>>> OTLP-compatible backend.
>>>>>
>>>>> # Background
>>>>> Iceberg ships three built-in MetricsReporter implementations today: 
>>>>> LoggingMetricsReporter, InMemoryMetricsReporter (Spark-internal), and 
>>>>> RESTMetricsReporter (REST catalog only).
>>>>> None of them give users an out-of-the-box way to ship scan/commit metrics 
>>>>> to an external observability platform.
>>>>> The gap applies to Spark users on non-REST catalogs and to all non-Spark 
>>>>> engines (Trino, Flink, etc.).
>>>>>
>>>>> # Motivation
>>>>> OpenTelemetry is the vendor-neutral CNCF standard for telemetry, 
>>>>> supported by every major observability backend (Prometheus, CloudWatch, 
>>>>> Datadog, Grafana Cloud, etc.).
>>>>> A single OTLP-based MetricsReporter in Iceberg lets users reach all of 
>>>>> these without per-vendor integrations in the project.
>>>>> This is complementary to #14360, which adds OTel support to HTTPClient at 
>>>>> the REST-catalog HTTP layer; this proposal covers the Iceberg-level 
>>>>> ScanReport / CommitReport layer.
>>>>>
>>>>> # Proposal
>>>>> Issue: https://github.com/apache/iceberg/issues/16169
>>>>> PR:    https://github.com/apache/iceberg/pull/16250
>>>>>
>>>>> The reporter follows the same SDK-ownership philosophy as #14360 - the 
>>>>> host application (Spark/Flink/Trino/...) registers an OpenTelemetrySdk 
>>>>> via GlobalOpenTelemetry, and the reporter just looks up a Meter from it.
>>>>> The reporter has zero Iceberg-specific catalog properties; everything 
>>>>> else is owned by the host.
>>>>>
>>>>> The PR has been validated end-to-end against two unrelated OTLP backends 
>>>>> (Databricks Zerobus and Amazon CloudWatch) - full procedures and queries 
>>>>> are linked from the PR.
>>>>>
>>>>> # On dependencies
>>>>> Given the current sensitivity around new runtime dependencies in 1.11, 
>>>>> the PR adds only opentelemetry-api to iceberg-core as compileOnly.
>>>>> The OpenTelemetry SDK and OTLP exporters are not added to the runtime 
>>>>> classpath - they come from the host application.
>>>>> opentelemetry-sdk / -sdk-testing are testImplementation only.
>>>>>
>>>>> # Questions for the community
>>>>>
>>>>> Q1. Any objection to taking the opentelemetry-api compileOnly dependency 
>>>>> in iceberg-core?
>>>>> Q2. Module placement: iceberg-core (current PR), or a separate 
>>>>> iceberg-opentelemetry module?
>>>>>
>>>>> Thanks,
>>>>> Noritaka Sekiyama, Databricks

Reply via email to