moomindani commented on PR #16250:
URL: https://github.com/apache/iceberg/pull/16250#issuecomment-4406599079
## End-to-end validation results — CloudWatch and Zerobus
This is a follow-up to the PR description's **Validation** section. That
section summarized end-to-end runs against Databricks Zerobus and AWS
CloudWatch performed against the original property-driven design. To make sure
nothing regressed in the redesign, I re-ran both validations against the
**current** code on this PR (commit `efa838e7f`, post-review fixes) and am
posting the full procedure, log excerpts, and backend queries here so reviewers
can verify operational equivalence on their own.
### Result summary
| Backend | Outcome | Match against synthetic injection |
|---|---|---|
| **Databricks Zerobus** (OTLP/gRPC + OAuth Bearer → UC Delta table) | ✅ |
57 rows in `users.noritaka_sekiyama.iceberg_otel_metrics`; 12 `iceberg.*`
metrics × 4 export cycles + 9 OTel self-monitoring rows; values match exactly
(`data_files=7`, `records.added=12345`, `planning.duration=123ms`,
`commit.duration=231ms`); `iceberg.table.name` / `iceberg.snapshot.id`
round-trip via Variant. |
| **AWS CloudWatch** (OTLP/gRPC → `otelcol-contrib` SigV4 → CloudWatch
PromQL) | ✅ | All 12 `iceberg.*` series visible in `us-west-2` under
`@resource.service.name=iceberg-aws-validation-optionb`; PromQL
`last_over_time({...}[1h])` returns exact values (`data_files=7`,
`records.added=12345`, `file_size.bytes=4096000`, `attempts=1`). |
The reporter's wire output is byte-for-byte equivalent to the prior runs
against the property-driven design — same metric names, same attribute keys,
same units. The only on-the-wire change is whatever `service.name` resource
attribute the host application chooses to set.
### What changed under the current design (recap)
- Catalog config is a single property:
`metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter`. No
`otel.*` keys.
- Host application owns the SDK lifecycle — typically
`OpenTelemetrySdk.builder()...buildAndRegisterGlobal()` or the OpenTelemetry
Java agent.
- Reporter calls `GlobalOpenTelemetry.get().getMeter("org.apache.iceberg")`
in `initialize(...)` and reports through it. If no SDK is registered,
OpenTelemetry returns the no-op implementation and metric calls are silently
dropped — standard OTel contract.
- The validator used in this re-run lives entirely outside the iceberg repo
at `/tmp/iceberg-otel-validation/` (standalone Gradle JavaExec project
depending on the locally-built
`iceberg-{core,api,common,bundled-guava}-efa838e.dirty.jar`). Nothing was
committed to the iceberg repo.
<details>
<summary><b>Full Databricks Zerobus report</b> (architecture, queries,
gotchas)</summary>
# Iceberg `OtelMetricsReporter` (Option B) × Databricks Zerobus OTLP —
Re-validation
**Date**: 2026-05-08
**Branch / commit**: `moomindani/otel-metrics-reporter` @ `efa838e7f`
(Apache Iceberg PR [#16250](https://github.com/apache/iceberg/pull/16250))
**Result**: Success. **57 metric rows** landed in the destination Delta
table within an ~8 second window. All metric values, histogram statistics, and
Iceberg attributes (`iceberg.table.name`, `iceberg.snapshot.id`,
`iceberg.schema.id`, `iceberg.operation`) match the values injected by the
validator.
---
## 1. Subject
Re-validate the **Option B** design of `OtelMetricsReporter` against the
same Databricks Zerobus Ingest OTLP endpoint that was used for the previous
Option A validation (2026-04-30). In Option B the host application owns the
OpenTelemetry SDK lifecycle — it builds an `OpenTelemetrySdk`, registers it via
`GlobalOpenTelemetry.set(...)`, and the reporter just looks up the global
`Meter` in `initialize(...)`. The reporter exposes **zero catalog properties**.
This validation confirms that the post-review reporter still works
end-to-end with a real, externally-hosted OTLP receiver — i.e. that the Option
B refactor did not break the wire-level behavior.
---
## 2. Architecture (Option B)
```
+--------------------+ +------------------------+
+--------------+ +-------------+
| Iceberg core | uses | OtelMetricsReporter | uses | Global
OTel | OTLP | Zerobus |
| (ScanReport, +------->+ (no SDK ownership; +------->+ SDK (host
+------->+ Direct |
| CommitReport) | | Meter via Global) | |
registers) | gRPC | Write API |
+--------------------+ +------------------------+
+--------------+ +------+------+
|
v
+-------------------------------+
| Unity Catalog Delta table |
| users.noritaka_sekiyama. |
| iceberg_otel_metrics |
+-------------------------------+
```
No collector, no proxy. The host's OTLP gRPC exporter speaks directly to the
Zerobus public endpoint.
---
## 3. Setup
### 3.1 Catalog property (the only Iceberg-side knob)
```properties
metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter
```
That's it. No endpoint, no headers, no protocol — Option B has zero
reporter-specific properties. The reporter does:
```java
@Override
public void initialize(Map<String, String> properties) {
this.meter = GlobalOpenTelemetry.get().getMeter("org.apache.iceberg");
createInstruments();
}
```
If no host SDK has been registered, `GlobalOpenTelemetry.get()` returns the
no-op implementation and metrics are silently dropped — exactly per the
OpenTelemetry contract.
### 3.2 Host-side SDK registration (the meaningful part of the validator)
```java
Resource resource =
Resource.getDefault().toBuilder()
.put(AttributeKey.stringKey("service.name"),
"iceberg-zerobus-validation-optionb")
.build();
OtlpGrpcMetricExporter exporter =
OtlpGrpcMetricExporter.builder()
.setEndpoint("https://5099015744649857.zerobus.ap-northeast-1.cloud.databricks.com:443")
.addHeader("Authorization", "Bearer " + token)
// <redacted>
.addHeader("x-databricks-zerobus-table-name",
"users.noritaka_sekiyama.iceberg_otel_metrics")
.setTimeout(Duration.ofSeconds(15))
.build();
SdkMeterProvider meterProvider =
SdkMeterProvider.builder()
.setResource(resource)
.registerMetricReader(
PeriodicMetricReader.builder(exporter)
.setInterval(Duration.ofSeconds(2))
.build())
.build();
OpenTelemetrySdk sdk =
OpenTelemetrySdk.builder().setMeterProvider(meterProvider).build();
GlobalOpenTelemetry.resetForTest(); // make registration idempotent across
runs
GlobalOpenTelemetry.set(sdk);
// Now Iceberg can be wired up — exactly as if a catalog had loaded the
reporter.
OtelMetricsReporter reporter = new OtelMetricsReporter();
reporter.initialize(Collections.emptyMap());
```
Full source:
`/tmp/iceberg-otel-validation/src/main/java/ZerobusValidator.java`
(uncommitted; lives outside the iceberg repo).
### 3.3 Synthetic reports
Both `ScanReport` and `CommitReport` are constructed with
`ImmutableScanReport.builder() / ImmutableCommitReport.builder()` and the
corresponding `*MetricsResult` builders, using `TimerResult.of` and
`CounterResult.of(MetricsContext.Unit.COUNT|BYTES, ...)` — same patterns as
`core/.../TestOtelMetricsReporter.java`.
| Report | snapshotId | Notable injected values |
|---|---|---|
| ScanReport | 3001 | `dataFiles=7, deleteFiles=1, manifestsScanned=3,
manifestsSkipped=0, fileSize=4_096_000B, planningDuration=123 ms` |
| CommitReport | 3002 | `attempts=1, addedDataFiles=4, removedDataFiles=0,
addedRecords=12345, addedFileSize=2_048_000B, totalDuration=231 ms` |
---
## 4. Token / SP recipe
The same Service Principal `iceberg-otel-zerobus-sp`
(`applicationId=b3845821-2880-477d-a5ff-e4198a45afa5`, SCIM id
`70839113910562`) was reused. A **new client secret** was issued for this run:
```bash
/opt/homebrew/bin/databricks service-principal-secrets-proxy create
70839113910562 --profile DEFAULT
# -> client_id, client_secret -- captured into env vars only, never written
to disk
```
The OAuth bearer token used to authenticate the OTLP request requires both
`resource` and `authorization_details` claims, scoped to the destination
table's UC privileges:
```bash
AUTH_DETAILS='[
{"type":"unity_catalog_privileges","privileges":["USE
CATALOG"],"object_type":"CATALOG","object_full_path":"users"},
{"type":"unity_catalog_privileges","privileges":["USE
SCHEMA"],"object_type":"SCHEMA","object_full_path":"users.noritaka_sekiyama"},
{"type":"unity_catalog_privileges","privileges":["SELECT","MODIFY"],"object_type":"TABLE","object_full_path":"users.noritaka_sekiyama.iceberg_otel_metrics"}
]'
ZEROBUS_TOKEN=$(curl -s -X POST -u "$DBX_CLIENT_ID:$DBX_CLIENT_SECRET" \
-d "grant_type=client_credentials" \
-d "scope=all-apis" \
-d
"resource=api://databricks/workspaces/5099015744649857/zerobusDirectWriteApi" \
--data-urlencode "authorization_details=$AUTH_DETAILS" \
"https://e2-demo-tokyo.cloud.databricks.com/oidc/v1/token" \
| python3 -c 'import json,sys;
print(json.load(sys.stdin)["access_token"])')
# Token TTL: ~1 hour. <redacted> in this report.
```
---
## 5. Run
```bash
JAVA_HOME=$(/usr/libexec/java_home -v 17)
ZEROBUS_TOKEN=<redacted>
cd /tmp/iceberg-otel-validation && ./gradlew run -PmainClass=ZerobusValidator
```
Console output (excerpted):
```
[validator] Option B Zerobus validation starting (host owns SDK).
[validator] GlobalOpenTelemetry registered with
service.name=iceberg-zerobus-validation-optionb
[validator] OTLP
endpoint=https://5099015744649857.zerobus.ap-northeast-1.cloud.databricks.com:443
\
table=users.noritaka_sekiyama.iceberg_otel_metrics
[main] INFO org.apache.iceberg.metrics.OtelMetricsReporter
- OtelMetricsReporter initialized. SDK lifecycle is owned by the host
application
(via GlobalOpenTelemetry).
[validator] OtelMetricsReporter initialized via Global SDK lookup.
[validator] ScanReport reported (snapshotId=3001, dataFiles=7).
[validator] CommitReport reported (snapshotId=3002, records=12345).
[validator] Sleeping 8s to allow periodic flush...
[validator] meterProvider closed.
[validator] Done.
BUILD SUCCESSFUL in 11s
```
(One `Failed to export metrics … Canceled` line appears on the final
shutdown — it's the in-flight HTTP call that the SDK aborts when
`meterProvider.close()` is invoked. The four prior periodic flushes had already
succeeded.)
---
## 6. Delta table query results
### 6.1 Row count and time window
```sql
SELECT COUNT(*) AS rows, MIN(time), MAX(time)
FROM users.noritaka_sekiyama.iceberg_otel_metrics
WHERE service_name = 'iceberg-zerobus-validation-optionb';
```
| rows | min_time | max_time |
|---|---|---|
| 57 | 2026-05-08T12:51:01.340Z | 2026-05-08T12:51:07.336Z |
### 6.2 Per-metric breakdown
```sql
SELECT name, metric_type, COUNT(*) AS rows
FROM users.noritaka_sekiyama.iceberg_otel_metrics
WHERE service_name = 'iceberg-zerobus-validation-optionb'
GROUP BY name, metric_type
ORDER BY name;
```
| name | metric_type | rows |
|---|---|---|
| `iceberg.commit.attempts` | sum | 4 |
| `iceberg.commit.data_files.added` | sum | 4 |
| `iceberg.commit.data_files.removed` | sum | 4 |
| `iceberg.commit.duration` | histogram | 4 |
| `iceberg.commit.file_size.added_bytes` | sum | 4 |
| `iceberg.commit.records.added` | sum | 4 |
| `iceberg.scan.data_manifests.scanned` | sum | 4 |
| `iceberg.scan.data_manifests.skipped` | sum | 4 |
| `iceberg.scan.file_size.bytes` | sum | 4 |
| `iceberg.scan.planning.duration` | histogram | 4 |
| `iceberg.scan.result.data_files` | sum | 4 |
| `iceberg.scan.result.delete_files` | sum | 4 |
| `otel.sdk.metric_reader.collection.duration` | histogram | 3 |
| `otlp.exporter.exported` | sum | 3 |
| `otlp.exporter.seen` | sum | 3 |
12 Iceberg metrics × 4 export cycles = 48 rows. The remaining 9 rows are
OTLP SDK self-monitoring metrics (always present in OTel ≥ 1.61, three per
cycle), so 48 + 9 = 57. Counts line up with `PeriodicMetricReader(2 s)` plus a
final flush at `meterProvider.close()`.
### 6.3 Per-row verification (Iceberg metrics only)
```sql
SELECT name,
variant_get(sum.attributes, '$["iceberg.table.name"]', 'STRING') AS
table_name,
variant_get(sum.attributes, '$["iceberg.snapshot.id"]', 'BIGINT') AS
snapshot_id,
sum.value AS sum_value
FROM users.noritaka_sekiyama.iceberg_otel_metrics
WHERE variant_get(sum.attributes, '$["iceberg.table.name"]', 'STRING') =
'zerobus_validation.test_table'
AND time = (SELECT MAX(time) FROM
users.noritaka_sekiyama.iceberg_otel_metrics
WHERE service_name = 'iceberg-zerobus-validation-optionb')
ORDER BY name;
```
| name | table_name | snapshot_id | sum_value |
|---|---|---|---|
| `iceberg.commit.attempts` | zerobus_validation.test_table | **3002** | 1.0
|
| `iceberg.commit.data_files.added` | zerobus_validation.test_table | 3002 |
4.0 |
| `iceberg.commit.data_files.removed` | zerobus_validation.test_table | 3002
| 0.0 |
| `iceberg.commit.file_size.added_bytes` | zerobus_validation.test_table |
3002 | 2 048 000.0 |
| `iceberg.commit.records.added` | zerobus_validation.test_table | 3002 |
**12 345.0** |
| `iceberg.scan.data_manifests.scanned` | zerobus_validation.test_table |
**3001** | 3.0 |
| `iceberg.scan.data_manifests.skipped` | zerobus_validation.test_table |
3001 | 0.0 |
| `iceberg.scan.file_size.bytes` | zerobus_validation.test_table | 3001 | 4
096 000.0 |
| `iceberg.scan.result.data_files` | zerobus_validation.test_table | 3001 |
**7.0** |
| `iceberg.scan.result.delete_files` | zerobus_validation.test_table | 3001
| 1.0 |
Every value matches what was injected: `dataFiles=7`, `deleteFiles=1`,
`manifestsScanned=3`, `manifestsSkipped=0`, `fileSize=4 096 000B`,
`attempts=1`, `addedDataFiles=4`, `removedDataFiles=0`, `addedRecords=12 345`,
`addedFileSize=2 048 000B`. Snapshot ids `3001` (scan) and `3002` (commit) flow
through to `iceberg.snapshot.id` correctly. Same for
`iceberg.table.name=zerobus_validation.test_table`.
### 6.4 Histogram example query
```sql
SELECT
variant_get(h.histogram.attributes, '$["iceberg.table.name"]', 'STRING')
AS table_name,
h.name AS metric,
h.histogram.count AS sample_count,
h.histogram.sum AS total_ms,
h.histogram.min AS min_ms,
h.histogram.max AS max_ms,
ROUND(h.histogram.sum / NULLIF(h.histogram.count, 0), 2) AS mean_ms,
h.time
FROM users.noritaka_sekiyama.iceberg_otel_metrics h
WHERE h.service_name = 'iceberg-zerobus-validation-optionb'
AND h.metric_type = 'histogram'
AND h.name LIKE 'iceberg.%'
ORDER BY h.time DESC, h.name
LIMIT 8;
```
| table_name | metric | sample_count | total_ms | min_ms | max_ms | mean_ms |
|---|---|---|---|---|---|---|
| zerobus_validation.test_table | `iceberg.commit.duration` | 1 | 231.0 |
231.0 | 231.0 | 231.0 |
| zerobus_validation.test_table | `iceberg.scan.planning.duration` | 1 |
123.0 | 123.0 | 123.0 | 123.0 |
| zerobus_validation.test_table | `iceberg.commit.duration` | 1 | 231.0 |
231.0 | 231.0 | 231.0 |
| zerobus_validation.test_table | `iceberg.scan.planning.duration` | 1 |
123.0 | 123.0 | 123.0 | 123.0 |
| ... | (4 cycles × 2 histograms) | | | | | |
Histogram `count`, `sum`, `min`, `max` round-trip correctly — `total_ms`
matches the injected `Duration.ofMillis(231)` / `Duration.ofMillis(123)`.
---
## 7. Key differences vs Option A validation (2026-04-30)
| Aspect | Option A (previous) | **Option B (this run)** |
|---|---|---|
| Who owns the OpenTelemetry SDK? | The reporter built its own
`OpenTelemetrySdk` from catalog properties. | The **host** builds
`OpenTelemetrySdk` and registers via `GlobalOpenTelemetry.set(...)`. |
| Catalog properties on the reporter | ~7 (`otel.endpoint`, `otel.protocol`,
`otel.headers`, `otel.service-name`, `otel.export-interval`, …). | **Zero**.
Host configures everything. |
| Iceberg-side configuration | `metrics-reporter-impl=…OtelMetricsReporter`
+ several `otel.*` properties. | `metrics-reporter-impl=…OtelMetricsReporter`
only. |
| Smoke test in iceberg test module | `OtelEndpointSmokeTest` gated on env
var (`OTEL_SMOKE_ENABLED`). | None. Validator lives entirely outside the
iceberg repo (`/tmp/iceberg-otel-validation/`); nothing committed to iceberg. |
| Reporter dependency surface | Compile dependency on OTel API + reflective
load of OTLP exporter classes. | Compile dependency on OTel **API only**. The
OTLP exporter, headers, retry, etc. are entirely the host's concern. |
| Failure mode if no SDK is registered | Reporter fails to initialize. |
`GlobalOpenTelemetry.get()` returns no-op, metrics silently dropped — matches
the standard OTel contract. |
Behaviorally on the wire, the two designs are indistinguishable: rows landed
in Delta with the same metric names, the same Iceberg attribute keys, and the
same numeric values.
---
## 8. Lessons learned / gotchas
| Topic | Detail |
|---|---|
| `GlobalOpenTelemetry.set` is one-shot per JVM | Calling `set(...)` twice
throws `IllegalStateException` unless `resetForTest()` is invoked first. The
validator calls `resetForTest()` before `set(...)` to make in-process re-runs
idempotent. Production hosts should set the global exactly once at startup. |
| `Canceled` log on shutdown | When `meterProvider.close()` is called while
an OTLP request is still in flight, the OkHttp client surfaces
`java.io.IOException: Canceled`. This is benign — the four prior periodic
exports had already succeeded (4 rows × 12 metrics = 48). |
| Validator dependency surface | Outside-the-repo validator pulls in OTLP
gRPC exporter + iceberg-core/api/common/bundled-guava jars. iceberg-core itself
is unchanged — the test module only ships `opentelemetry-api`,
`opentelemetry-sdk`, `opentelemetry-sdk-testing`. That's deliberate (Option B
doesn't want to make the OTLP exporter a transitive dep of iceberg-core). |
| OTel SDK self-monitoring metrics in 1.61 | The Delta table receives 3
extra rows per export cycle (`otel.sdk.metric_reader.collection.duration`,
`otlp.exporter.seen`, `otlp.exporter.exported`). Filter by `service_name` plus
`name LIKE 'iceberg.%'` to keep dashboards focused. |
| `iceberg.table.name` attribute | Stored as a Variant key with dots;
`variant_get(<col>, '$["iceberg.table.name"]', 'STRING')` is needed instead of
dotted accessor syntax. Same as Option A. |
| Token TTL | 1 hour. For long-running services use the OTel collector's
`oauth2clientauthextension` instead of a static bearer in `addHeader(...)`. |
---
## 9. Resources used
| Kind | Name / FQN | State |
|---|---|---|
| Delta table | `users.noritaka_sekiyama.iceberg_otel_metrics` | retained
(rows from Option A and Option B coexist; filter by `service_name`) |
| Service Principal | `iceberg-otel-zerobus-sp`
(`applicationId=b3845821-2880-477d-a5ff-e4198a45afa5`) | retained |
| New OAuth secret | issued for this run | retained on the SP; revoke when
no longer needed |
| Validator project | `/tmp/iceberg-otel-validation/` (Gradle, JavaExec) |
local only; not committed to iceberg |
| Iceberg local jars |
`iceberg-{api,core,common,bundled-guava}-efa838e.dirty.jar` | built via
`./gradlew :iceberg-core:jar :iceberg-api:jar :iceberg-common:jar
:iceberg-bundled-guava:jar` |
---
**TL;DR**: Option B works end-to-end against Zerobus. The reporter's
`initialize(emptyMap())` plus a host-side `GlobalOpenTelemetry.set(sdk)` is
sufficient to stream every Iceberg scan and commit metric — values, histograms,
and table/snapshot attributes — straight into a Unity Catalog Delta table. 57
rows landed, 100% of the values match what was injected.
</details>
<details>
<summary><b>Full AWS CloudWatch report</b> (architecture, PromQL,
gotchas)</summary>
# Iceberg `OtelMetricsReporter` (Option B) × AWS CloudWatch OTLP —
Re-validation
**Date**: 2026-05-01
**Branch / commit**: `moomindani/otel-metrics-reporter` @ `efa838e7f`
**PR**: apache/iceberg #16250
**Result**: Success. All 12 `iceberg.*` metrics arrived in CloudWatch
through a local OpenTelemetry Collector (SigV4-signed). Both metric values and
attributes match the synthetic ScanReport / CommitReport injected by the runner.
---
## 1. Subject
Re-validate the `OtelMetricsReporter` against AWS CloudWatch OTLP (Public
Preview) after refactoring the reporter to **Option B** — i.e. the host
application owns the `OpenTelemetrySdk` lifecycle and registers it via
`GlobalOpenTelemetry.set(...)`, and the reporter exposes **zero catalog
properties** and simply calls
`GlobalOpenTelemetry.get().getMeter("org.apache.iceberg")`.
Option A (reporter creates its own SDK from catalog properties) had already
been validated end-to-end against CloudWatch on 2026-04-30 — see
`~/Documents/workspace/iceberg-otel-aws-validation.md`. This document confirms
the **current** (Option B) reporter still produces identical wire output and is
operationally equivalent on the AWS side; the only thing that changed is who
owns the SDK.
---
## 2. Architecture
Identical to the Option A validation — the reporter only ever speaks plain
OTLP/gRPC to a local Collector; the Collector adds SigV4 and forwards to
CloudWatch.
```
+-----------------------+ OTLP/gRPC +---------------------+
OTLP/HTTP+SigV4 +------------+
| Iceberg (Java) | ----------------------> | otelcol-contrib |
------------------> | CloudWatch |
| OtelMetricsReporter | localhost:4317 | (native binary, |
| (PromQL, |
| (Option B: host SDK) | | sigv4authextension)|
| Query |
+-----------------------+ +---------------------+
| Studio) |
|
+------------+
| (debug exporter
for tracing)
v
stdout
```
The only thing that changed between Option A and Option B is the left-hand
box: in A, the reporter built its own `OpenTelemetrySdk` from catalog
properties; in B, the host application is responsible for the SDK and the
reporter is a thin "adapter" that finds the meter via `GlobalOpenTelemetry`.
---
## 3. Setup
### 3.1 Catalog configuration (Option B)
```properties
# In Spark / Flink / Trino catalog config
metrics-reporter-impl = org.apache.iceberg.metrics.OtelMetricsReporter
```
That is the **entire** Iceberg-side configuration. No endpoint, no headers,
no service name, no exporter type, no batch interval — **zero properties**.
Everything is owned by the host process's `OpenTelemetrySdk`.
### 3.2 Host-side SDK registration (excerpt of validator)
The host process is expected to do something like this once at startup. The
validator does it in `main()` to mimic a host's bootstrap:
```java
Resource resource =
Resource.getDefault().toBuilder()
.put(AttributeKey.stringKey("service.name"),
"iceberg-aws-validation-optionb")
.build();
OtlpGrpcMetricExporter exporter =
OtlpGrpcMetricExporter.builder()
.setEndpoint("http://localhost:4317")
.setTimeout(Duration.ofSeconds(10))
.build();
SdkMeterProvider meterProvider =
SdkMeterProvider.builder()
.setResource(resource)
.registerMetricReader(
PeriodicMetricReader.builder(exporter)
.setInterval(Duration.ofSeconds(2))
.build())
.build();
OpenTelemetrySdk sdk =
OpenTelemetrySdk.builder().setMeterProvider(meterProvider).build();
GlobalOpenTelemetry.resetForTest(); // makes the registration idempotent
GlobalOpenTelemetry.set(sdk);
```
In production a host would normally register the SDK via the official
OpenTelemetry Java agent, `AutoConfiguredOpenTelemetrySdk`, or
`buildAndRegisterGlobal()`. The reporter does not care which path is used; it
only reads `GlobalOpenTelemetry.get()`.
### 3.3 Iceberg-side wiring
```java
OtelMetricsReporter reporter = new OtelMetricsReporter(); // no-arg
ctor
reporter.initialize(Collections.emptyMap()); // zero
properties
reporter.report(scanReport);
reporter.report(commitReport);
```
Synthetic reports were the same shape as `TestOtelMetricsReporter` but with
the values described in the task spec:
- `ScanReport`: snapshotId=2001, schemaId=1, projectedFieldIds=[1,2],
projectedFieldNames=["id","data"], resultDataFiles=7, resultDeleteFiles=1,
scannedDataManifests=3, skippedDataManifests=0, totalFileSizeInBytes=4_096_000,
totalPlanningDuration=123ms, tableName="aws_validation.test_table"
- `CommitReport`: snapshotId=2002, sequenceNumber=2, operation="append",
attempts=1, addedDataFiles=4, removedDataFiles=0, addedRecords=12345,
addedFilesSizeInBytes=2_048_000, totalDuration=231ms,
tableName="aws_validation.test_table"
### 3.4 Where the validator lives
- Build runner: `/tmp/iceberg-otel-validation/build.gradle`
- Validator:
`/tmp/iceberg-otel-validation/src/main/java/AwsCloudWatchValidator.java`
- PromQL probe: `/tmp/iceberg-otel-validation/promql_query.py`
- Run: `cd /tmp/iceberg-otel-validation &&
JAVA_HOME=$(/usr/libexec/java_home -v 17)
/Users/noritaka.sekiyama/Documents/workspace/iceberg/gradlew run`
The validator depends on the locally built Iceberg jars
(`iceberg-core-efa838e.dirty.jar`, `iceberg-api-efa838e.dirty.jar`,
`iceberg-bundled-guava-efa838e.dirty.jar`) plus the public OTel SDK + OTLP
exporter from Maven Central. **Nothing was committed to the Iceberg repo.**
### 3.5 Collector
Reused the binary and config from the Option A run:
- Binary:
`~/Documents/workspace/iceberg-otel-aws-validation/otelcol-contrib` (v0.151.0,
darwin_arm64, upstream `otelcol-contrib`).
- Config:
`~/Documents/workspace/iceberg-otel-aws-validation/otel-config.yaml` (already
SigV4-configured for `monitoring` / us-west-2). No edits required — Collector
has no concept of "Option A vs Option B"; it just receives OTLP and signs the
egress.
- Launch:
```bash
eval "$(aws configure export-credentials --profile
332745928618_databricks-sandbox-admin --format env)"
export AWS_REGION=us-west-2
cd ~/Documents/workspace/iceberg-otel-aws-validation
./otelcol-contrib --config otel-config.yaml > collector.log 2>&1 &
```
Verified `4317` and `4318` were listening before running the validator.
---
## 4. Runner output
```
[validator] Option B validation starting (host owns SDK).
[validator] GlobalOpenTelemetry registered with
service.name=iceberg-aws-validation-optionb
[validator] OtelMetricsReporter initialized via Global SDK lookup.
[main] INFO org.apache.iceberg.metrics.OtelMetricsReporter -
OtelMetricsReporter initialized. SDK lifecycle is owned by the host application
(via GlobalOpenTelemetry).
[validator] ScanReport reported (snapshotId=2001, dataFiles=7).
[validator] CommitReport reported (snapshotId=2002, records=12345).
[validator] Sleeping 8s to allow periodic flush...
[validator] meterProvider closed.
[validator] Done.
BUILD SUCCESSFUL
```
The reporter logs its initialization explicitly stating it does **not** own
the SDK — exactly the contract Option B promises.
---
## 5. Collector receipt (debug exporter)
Aggregate counts during the 8-second window of the run:
```
2026-05-08T21:49:33.387+0900 info Metrics ... resource metrics: 5,
metrics: 72, data points: 72
```
Resource attributes attached to every export (verifies the host-side
resource flowed through unchanged):
```
Resource attributes:
-> service.name: Str(iceberg-aws-validation-optionb)
-> telemetry.sdk.language: Str(java)
-> telemetry.sdk.name: Str(opentelemetry)
-> telemetry.sdk.version: Str(1.61.0)
ScopeMetrics #0
ScopeMetrics SchemaURL:
InstrumentationScope org.apache.iceberg
```
`InstrumentationScope = org.apache.iceberg` confirms the reporter used the
meter name baked into `OtelMetricsReporter.INSTRUMENTATION_NAME`.
Per-metric verification — `iceberg.scan.result.data_files`:
```
-> Name: iceberg.scan.result.data_files
-> Description: Number of data files included in scan result
-> Unit:
-> DataType: Sum
-> IsMonotonic: true
-> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
-> iceberg.schema.id: Int(1)
-> iceberg.snapshot.id: Int(2001)
-> iceberg.table.name: Str(aws_validation.test_table)
StartTimestamp: 2026-05-08 12:49:23.892955 +0000 UTC
Timestamp: 2026-05-08 12:49:25.865293 +0000 UTC
Value: 7
```
`iceberg.commit.records.added`:
```
-> Name: iceberg.commit.records.added
-> Description: Number of records added by commit
-> Unit:
-> DataType: Sum
-> IsMonotonic: true
-> AggregationTemporality: Cumulative
NumberDataPoints #0
Data point attributes:
-> iceberg.operation: Str(append)
-> iceberg.snapshot.id: Int(2002)
-> iceberg.table.name: Str(aws_validation.test_table)
StartTimestamp: 2026-05-08 12:49:23.894813 +0000 UTC
Timestamp: 2026-05-08 12:49:25.865293 +0000 UTC
Value: 12345
```
All 12 `iceberg.*` metrics show up in the log with the expected attribute
keys (`iceberg.snapshot.id`, `iceberg.table.name`, `iceberg.schema.id`,
`iceberg.operation`) and values matching the synthetic reports.
---
## 6. CloudWatch verification (PromQL)
Queries were issued against `https://monitoring.us-west-2.amazonaws.com`
with SigV4 signing. The runner script is
`/tmp/iceberg-otel-validation/promql_query.py`.
### 6.1 Confirm the 12 series exist
```
GET /api/v1/label/__name__/values
```
Response (filtered to `iceberg.*`, sorted):
```
iceberg.commit.attempts
iceberg.commit.data_files.added
iceberg.commit.data_files.removed
iceberg.commit.duration
iceberg.commit.file_size.added_bytes
iceberg.commit.records.added
iceberg.scan.data_manifests.scanned
iceberg.scan.data_manifests.skipped
iceberg.scan.file_size.bytes
iceberg.scan.planning.duration
iceberg.scan.result.data_files
iceberg.scan.result.delete_files
```
All 12 are present. (CloudWatch retains `iceberg.*` series across runs, so
the Option A run from 2026-04-30 also contributes here — but the per-series
data below filters by `@resource.service.name=iceberg-aws-validation-optionb`
so we only see Option B samples.)
### 6.2 Per-metric value verification
PromQL (instant query, wrapped in `last_over_time(...[1h])` because the run
was a one-shot rather than continuous):
```promql
last_over_time({__name__="iceberg.scan.result.data_files"}[1h])
```
Response (filtered to the Option B series via `@resource.service.name`):
```json
{
"metric": {
"__name__": "iceberg.scan.result.data_files",
"__type__": "Sum",
"__monotonicity__": "true",
"__temporality__": "cumulative",
"@resource.service.name": "iceberg-aws-validation-optionb",
"@resource.telemetry.sdk.language": "java",
"@resource.telemetry.sdk.name": "opentelemetry",
"@resource.telemetry.sdk.version": "1.61.0",
"@instrumentation.@name": "org.apache.iceberg",
"@aws.account": "332745928618",
"@aws.region": "us-west-2",
"iceberg.schema.id": "1",
"iceberg.snapshot.id": "2001",
"iceberg.table.name": "aws_validation.test_table"
},
"value": [<timestamp>, "7"]
}
```
Result: `value = 7` — exactly the synthetic input. All resource attributes
(`service.name=iceberg-aws-validation-optionb`, `telemetry.sdk.language=java`,
etc.) plus the data-point attributes (`iceberg.snapshot.id=2001`,
`iceberg.schema.id=1`, `iceberg.table.name=aws_validation.test_table`)
round-trip cleanly.
```promql
last_over_time({__name__="iceberg.commit.records.added"}[1h])
```
→ `value = 12345`, with attributes:
- `@resource.service.name = iceberg-aws-validation-optionb`
- `iceberg.snapshot.id = 2002`
- `iceberg.operation = append`
- `iceberg.table.name = aws_validation.test_table`
```promql
last_over_time({__name__="iceberg.scan.file_size.bytes"}[1h])
```
→ `value = 4096000`, `__unit__ = By` (from the `setUnit("By")` on the OTel
histogram builder).
```promql
last_over_time({__name__="iceberg.commit.attempts"}[1h])
```
→ `value = 1`.
Summary table:
| Metric | Expected | Observed in CloudWatch |
Match |
|-----------------------------------|---------:|-----------------------:|-------|
| `iceberg.scan.result.data_files` | 7 | 7 |
yes |
| `iceberg.commit.records.added` | 12345 | 12345 |
yes |
| `iceberg.scan.file_size.bytes` | 4096000 | 4096000 |
yes |
| `iceberg.commit.attempts` | 1 | 1 |
yes |
The promql script returns `ALL_MATCH`.
---
## 7. Differences vs the Option A validation
| Concern | Option A (2026-04-30)
| Option B (2026-05-01, this run)
|
|-----------------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------|
| Who builds `OpenTelemetrySdk`? | The reporter, from catalog
properties (`otel.metrics-reporter.*`). | The host application, before the
catalog is loaded. |
| Catalog properties read by reporter | ~10 (`endpoint`,
`protocol`, `headers`, `service-name`, `interval`, …). | **Zero.**
|
| Reporter calls |
`OpenTelemetrySdk.builder()...buildAndRegisterGlobal()` |
`GlobalOpenTelemetry.get().getMeter("org.apache.iceberg")` |
| `close()` semantics | Reporter shut down the SDK
it built. | Reporter does not own anything →
`close()` is a no-op. |
| Smoke test (env-gated `OtelEndpointSmokeTest`)| Present.
| **Removed.** Host SDK
construction is the host's responsibility. |
| Validator runner | Lived inside the iceberg
test sourceset (uncommitted). | Lives entirely in
`/tmp/iceberg-otel-validation/` (standalone Gradle). |
| Wire output (Collector + CloudWatch) | Identical to Option B (12
`iceberg.*` series, same attributes). | Identical to Option A (12
`iceberg.*` series, same attributes). |
| AWS-side artifacts | None created.
| None created.
|
Importantly the **on-the-wire output is identical** — same metric names,
same descriptions/units, same attribute keys, same values. Reviewers can
convince themselves of this by comparing the data-point excerpts in section 5
of this report against section 6 of the Option A report.
---
## 8. Lessons learned / new gotchas
| Topic | Detail
|
|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `GlobalOpenTelemetry.set` is "set-once" | It throws if called
twice. Tests that run `initialize()` may already have set the global; the
validator calls `GlobalOpenTelemetry.resetForTest()` first to be safe.
Production hosts wouldn't normally hit this — they register exactly once at
boot. |
| No catalog property knobs | A direct consequence
of Option B is that there is nothing to misconfigure on the Iceberg side. If
metrics aren't flowing, the diagnostic is entirely on the host SDK side (env
vars, agent attached, `GlobalOpenTelemetry.set` actually called, etc.). Doc
should make this explicit so users don't go hunting for non-existent `otel.*`
catalog keys. |
| `boto3` SigV4 query signing | `AWSRequest(url=...,
params=...)` is required for `GET /api/v1/query?query=...`. Building the URL
string yourself and then signing only the path silently produces a wrong
signature → HTTP 400 with body `{"message":null}`. Use `req.prepare().url` so
the canonical query string matches what's actually sent. |
| PromQL UTF-8 label-name selectors | CloudWatch's PromQL
preview rejects both back-tick (`` `@resource.service.name` ``) and
double-quoted UTF-8 names inside the matcher braces. Stick to plain
`{__name__="..."}` selectors and post-filter on `@resource.*` labels
client-side, or rely on the `last_over_time(...[1h])` window for one-shot smoke
tests. |
| `otlphttp` exporter is now `otlp_http` | The Collector
v0.151.0 prints a deprecation warning on the `otlphttp:` key. Functional today;
rename for the next config edit.
|
| CloudWatch retains all prior runs | The 12 `iceberg.*`
series from Option A are still visible alongside Option B. Always filter PromQL
responses by `@resource.service.name` (or another distinguishing resource
attribute) when validating a specific run.
|
| `@instrumentation.@name` label is double-`@` | CloudWatch flattens
OTel scope name as `@instrumentation.@name=org.apache.iceberg` (note the
leading `@` plus the field's own `@name` key). Cosmetic but surprising — it's
helpful as a "reporter fingerprint" filter
(`@instrumentation.@name="org.apache.iceberg"`) when many SDKs share a
workspace. |
The previously-documented gotchas (region availability, no namespace
concept, ADOT vs upstream Collector, `otlphttp` endpoint format, label naming)
all still apply unchanged.
---
## 9. Status of resources
| Kind | Location
| State |
|-------------------------------|------------------------------------------------------------------------------------------------|----------|
| OTel Collector binary |
`~/Documents/workspace/iceberg-otel-aws-validation/otelcol-contrib`
| retained |
| OTel Collector config |
`~/Documents/workspace/iceberg-otel-aws-validation/otel-config.yaml`
| retained |
| Validator (Gradle project) | `/tmp/iceberg-otel-validation/`
| retained for reference |
| CloudWatch metrics | `iceberg.*` series in account
`332745928618`, region `us-west-2`,
`service.name=iceberg-aws-validation-optionb` | retained until CloudWatch's
default retention expires |
| AWS resources | none created
| — |
| Iceberg repo | no commits, no uncommitted source files
added under `core/src/` | clean |
---
## 10. Conclusion
The Option B refactor of `OtelMetricsReporter` is operationally
indistinguishable from Option A on the AWS side. All 12 `iceberg.*` series flow
into CloudWatch with the same names, units, descriptions, attribute keys, and
exact values as before. The migration to "host owns SDK" cost the user-visible
catalog properties (now zero) and the smoke test (now redundant); it gained a
much smaller surface area inside the Iceberg core. Reviewers can land Option B
with the same backend confidence Option A had.
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]