moomindani commented on PR #16250:
URL: https://github.com/apache/iceberg/pull/16250#issuecomment-4406599079

   ## End-to-end validation results — CloudWatch and Zerobus
   
   This is a follow-up to the PR description's **Validation** section. That 
section summarized end-to-end runs against Databricks Zerobus and AWS 
CloudWatch performed against the original property-driven design. To make sure 
nothing regressed in the redesign, I re-ran both validations against the 
**current** code on this PR (commit `efa838e7f`, post-review fixes) and am 
posting the full procedure, log excerpts, and backend queries here so reviewers 
can verify operational equivalence on their own.
   
   ### Result summary
   
   | Backend | Outcome | Match against synthetic injection |
   |---|---|---|
   | **Databricks Zerobus** (OTLP/gRPC + OAuth Bearer → UC Delta table) | ✅ | 
57 rows in `users.noritaka_sekiyama.iceberg_otel_metrics`; 12 `iceberg.*` 
metrics × 4 export cycles + 9 OTel self-monitoring rows; values match exactly 
(`data_files=7`, `records.added=12345`, `planning.duration=123ms`, 
`commit.duration=231ms`); `iceberg.table.name` / `iceberg.snapshot.id` 
round-trip via Variant. |
   | **AWS CloudWatch** (OTLP/gRPC → `otelcol-contrib` SigV4 → CloudWatch 
PromQL) | ✅ | All 12 `iceberg.*` series visible in `us-west-2` under 
`@resource.service.name=iceberg-aws-validation-optionb`; PromQL 
`last_over_time({...}[1h])` returns exact values (`data_files=7`, 
`records.added=12345`, `file_size.bytes=4096000`, `attempts=1`). |
   
   The reporter's wire output is byte-for-byte equivalent to the prior runs 
against the property-driven design — same metric names, same attribute keys, 
same units. The only on-the-wire change is whatever `service.name` resource 
attribute the host application chooses to set.
   
   ### What changed under the current design (recap)
   
   - Catalog config is a single property: 
`metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter`. No 
`otel.*` keys.
   - Host application owns the SDK lifecycle — typically 
`OpenTelemetrySdk.builder()...buildAndRegisterGlobal()` or the OpenTelemetry 
Java agent.
   - Reporter calls `GlobalOpenTelemetry.get().getMeter("org.apache.iceberg")` 
in `initialize(...)` and reports through it. If no SDK is registered, 
OpenTelemetry returns the no-op implementation and metric calls are silently 
dropped — standard OTel contract.
   - The validator used in this re-run lives entirely outside the iceberg repo 
at `/tmp/iceberg-otel-validation/` (standalone Gradle JavaExec project 
depending on the locally-built 
`iceberg-{core,api,common,bundled-guava}-efa838e.dirty.jar`). Nothing was 
committed to the iceberg repo.
   
   <details>
   <summary><b>Full Databricks Zerobus report</b> (architecture, queries, 
gotchas)</summary>
   
   # Iceberg `OtelMetricsReporter` (Option B) × Databricks Zerobus OTLP — 
Re-validation
   
   **Date**: 2026-05-08
   **Branch / commit**: `moomindani/otel-metrics-reporter` @ `efa838e7f` 
(Apache Iceberg PR [#16250](https://github.com/apache/iceberg/pull/16250))
   **Result**: Success. **57 metric rows** landed in the destination Delta 
table within an ~8 second window. All metric values, histogram statistics, and 
Iceberg attributes (`iceberg.table.name`, `iceberg.snapshot.id`, 
`iceberg.schema.id`, `iceberg.operation`) match the values injected by the 
validator.
   
   ---
   
   ## 1. Subject
   
   Re-validate the **Option B** design of `OtelMetricsReporter` against the 
same Databricks Zerobus Ingest OTLP endpoint that was used for the previous 
Option A validation (2026-04-30). In Option B the host application owns the 
OpenTelemetry SDK lifecycle — it builds an `OpenTelemetrySdk`, registers it via 
`GlobalOpenTelemetry.set(...)`, and the reporter just looks up the global 
`Meter` in `initialize(...)`. The reporter exposes **zero catalog properties**.
   
   This validation confirms that the post-review reporter still works 
end-to-end with a real, externally-hosted OTLP receiver — i.e. that the Option 
B refactor did not break the wire-level behavior.
   
   ---
   
   ## 2. Architecture (Option B)
   
   ```
   +--------------------+        +------------------------+        
+--------------+        +-------------+
   | Iceberg core       |  uses  | OtelMetricsReporter    |  uses  | Global 
OTel  |  OTLP  | Zerobus     |
   | (ScanReport,       +------->+ (no SDK ownership;     +------->+ SDK (host  
  +------->+ Direct      |
   |  CommitReport)     |        |  Meter via Global)     |        |  
registers)  |  gRPC  | Write API   |
   +--------------------+        +------------------------+        
+--------------+        +------+------+
                                                                                
                   |
                                                                                
                   v
                                                                                
+-------------------------------+
                                                                                
| Unity Catalog Delta table     |
                                                                                
| users.noritaka_sekiyama.      |
                                                                                
|   iceberg_otel_metrics        |
                                                                                
+-------------------------------+
   ```
   
   No collector, no proxy. The host's OTLP gRPC exporter speaks directly to the 
Zerobus public endpoint.
   
   ---
   
   ## 3. Setup
   
   ### 3.1 Catalog property (the only Iceberg-side knob)
   
   ```properties
   metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter
   ```
   
   That's it. No endpoint, no headers, no protocol — Option B has zero 
reporter-specific properties. The reporter does:
   
   ```java
   @Override
   public void initialize(Map<String, String> properties) {
     this.meter = GlobalOpenTelemetry.get().getMeter("org.apache.iceberg");
     createInstruments();
   }
   ```
   
   If no host SDK has been registered, `GlobalOpenTelemetry.get()` returns the 
no-op implementation and metrics are silently dropped — exactly per the 
OpenTelemetry contract.
   
   ### 3.2 Host-side SDK registration (the meaningful part of the validator)
   
   ```java
   Resource resource =
       Resource.getDefault().toBuilder()
           .put(AttributeKey.stringKey("service.name"), 
"iceberg-zerobus-validation-optionb")
           .build();
   
   OtlpGrpcMetricExporter exporter =
       OtlpGrpcMetricExporter.builder()
           
.setEndpoint("https://5099015744649857.zerobus.ap-northeast-1.cloud.databricks.com:443";)
           .addHeader("Authorization", "Bearer " + token)                       
// <redacted>
           .addHeader("x-databricks-zerobus-table-name",
                      "users.noritaka_sekiyama.iceberg_otel_metrics")
           .setTimeout(Duration.ofSeconds(15))
           .build();
   
   SdkMeterProvider meterProvider =
       SdkMeterProvider.builder()
           .setResource(resource)
           .registerMetricReader(
               PeriodicMetricReader.builder(exporter)
                   .setInterval(Duration.ofSeconds(2))
                   .build())
           .build();
   
   OpenTelemetrySdk sdk = 
OpenTelemetrySdk.builder().setMeterProvider(meterProvider).build();
   GlobalOpenTelemetry.resetForTest();   // make registration idempotent across 
runs
   GlobalOpenTelemetry.set(sdk);
   
   // Now Iceberg can be wired up — exactly as if a catalog had loaded the 
reporter.
   OtelMetricsReporter reporter = new OtelMetricsReporter();
   reporter.initialize(Collections.emptyMap());
   ```
   
   Full source: 
`/tmp/iceberg-otel-validation/src/main/java/ZerobusValidator.java` 
(uncommitted; lives outside the iceberg repo).
   
   ### 3.3 Synthetic reports
   
   Both `ScanReport` and `CommitReport` are constructed with 
`ImmutableScanReport.builder() / ImmutableCommitReport.builder()` and the 
corresponding `*MetricsResult` builders, using `TimerResult.of` and 
`CounterResult.of(MetricsContext.Unit.COUNT|BYTES, ...)` — same patterns as 
`core/.../TestOtelMetricsReporter.java`.
   
   | Report | snapshotId | Notable injected values |
   |---|---|---|
   | ScanReport | 3001 | `dataFiles=7, deleteFiles=1, manifestsScanned=3, 
manifestsSkipped=0, fileSize=4_096_000B, planningDuration=123 ms` |
   | CommitReport | 3002 | `attempts=1, addedDataFiles=4, removedDataFiles=0, 
addedRecords=12345, addedFileSize=2_048_000B, totalDuration=231 ms` |
   
   ---
   
   ## 4. Token / SP recipe
   
   The same Service Principal `iceberg-otel-zerobus-sp` 
(`applicationId=b3845821-2880-477d-a5ff-e4198a45afa5`, SCIM id 
`70839113910562`) was reused. A **new client secret** was issued for this run:
   
   ```bash
   /opt/homebrew/bin/databricks service-principal-secrets-proxy create 
70839113910562 --profile DEFAULT
   # -> client_id, client_secret  -- captured into env vars only, never written 
to disk
   ```
   
   The OAuth bearer token used to authenticate the OTLP request requires both 
`resource` and `authorization_details` claims, scoped to the destination 
table's UC privileges:
   
   ```bash
   AUTH_DETAILS='[
     {"type":"unity_catalog_privileges","privileges":["USE 
CATALOG"],"object_type":"CATALOG","object_full_path":"users"},
     {"type":"unity_catalog_privileges","privileges":["USE 
SCHEMA"],"object_type":"SCHEMA","object_full_path":"users.noritaka_sekiyama"},
     
{"type":"unity_catalog_privileges","privileges":["SELECT","MODIFY"],"object_type":"TABLE","object_full_path":"users.noritaka_sekiyama.iceberg_otel_metrics"}
   ]'
   
   ZEROBUS_TOKEN=$(curl -s -X POST -u "$DBX_CLIENT_ID:$DBX_CLIENT_SECRET" \
     -d "grant_type=client_credentials" \
     -d "scope=all-apis" \
     -d 
"resource=api://databricks/workspaces/5099015744649857/zerobusDirectWriteApi" \
     --data-urlencode "authorization_details=$AUTH_DETAILS" \
     "https://e2-demo-tokyo.cloud.databricks.com/oidc/v1/token"; \
     | python3 -c 'import json,sys; 
print(json.load(sys.stdin)["access_token"])')
   # Token TTL: ~1 hour. <redacted> in this report.
   ```
   
   ---
   
   ## 5. Run
   
   ```bash
   JAVA_HOME=$(/usr/libexec/java_home -v 17)
   ZEROBUS_TOKEN=<redacted>
   cd /tmp/iceberg-otel-validation && ./gradlew run -PmainClass=ZerobusValidator
   ```
   
   Console output (excerpted):
   
   ```
   [validator] Option B Zerobus validation starting (host owns SDK).
   [validator] GlobalOpenTelemetry registered with 
service.name=iceberg-zerobus-validation-optionb
   [validator] OTLP 
endpoint=https://5099015744649857.zerobus.ap-northeast-1.cloud.databricks.com:443
 \
               table=users.noritaka_sekiyama.iceberg_otel_metrics
   [main] INFO  org.apache.iceberg.metrics.OtelMetricsReporter
          - OtelMetricsReporter initialized. SDK lifecycle is owned by the host 
application
            (via GlobalOpenTelemetry).
   [validator] OtelMetricsReporter initialized via Global SDK lookup.
   [validator] ScanReport reported (snapshotId=3001, dataFiles=7).
   [validator] CommitReport reported (snapshotId=3002, records=12345).
   [validator] Sleeping 8s to allow periodic flush...
   [validator] meterProvider closed.
   [validator] Done.
   BUILD SUCCESSFUL in 11s
   ```
   
   (One `Failed to export metrics … Canceled` line appears on the final 
shutdown — it's the in-flight HTTP call that the SDK aborts when 
`meterProvider.close()` is invoked. The four prior periodic flushes had already 
succeeded.)
   
   ---
   
   ## 6. Delta table query results
   
   ### 6.1 Row count and time window
   
   ```sql
   SELECT COUNT(*) AS rows, MIN(time), MAX(time)
   FROM users.noritaka_sekiyama.iceberg_otel_metrics
   WHERE service_name = 'iceberg-zerobus-validation-optionb';
   ```
   
   | rows | min_time | max_time |
   |---|---|---|
   | 57 | 2026-05-08T12:51:01.340Z | 2026-05-08T12:51:07.336Z |
   
   ### 6.2 Per-metric breakdown
   
   ```sql
   SELECT name, metric_type, COUNT(*) AS rows
   FROM users.noritaka_sekiyama.iceberg_otel_metrics
   WHERE service_name = 'iceberg-zerobus-validation-optionb'
   GROUP BY name, metric_type
   ORDER BY name;
   ```
   
   | name | metric_type | rows |
   |---|---|---|
   | `iceberg.commit.attempts` | sum | 4 |
   | `iceberg.commit.data_files.added` | sum | 4 |
   | `iceberg.commit.data_files.removed` | sum | 4 |
   | `iceberg.commit.duration` | histogram | 4 |
   | `iceberg.commit.file_size.added_bytes` | sum | 4 |
   | `iceberg.commit.records.added` | sum | 4 |
   | `iceberg.scan.data_manifests.scanned` | sum | 4 |
   | `iceberg.scan.data_manifests.skipped` | sum | 4 |
   | `iceberg.scan.file_size.bytes` | sum | 4 |
   | `iceberg.scan.planning.duration` | histogram | 4 |
   | `iceberg.scan.result.data_files` | sum | 4 |
   | `iceberg.scan.result.delete_files` | sum | 4 |
   | `otel.sdk.metric_reader.collection.duration` | histogram | 3 |
   | `otlp.exporter.exported` | sum | 3 |
   | `otlp.exporter.seen` | sum | 3 |
   
   12 Iceberg metrics × 4 export cycles = 48 rows. The remaining 9 rows are 
OTLP SDK self-monitoring metrics (always present in OTel ≥ 1.61, three per 
cycle), so 48 + 9 = 57. Counts line up with `PeriodicMetricReader(2 s)` plus a 
final flush at `meterProvider.close()`.
   
   ### 6.3 Per-row verification (Iceberg metrics only)
   
   ```sql
   SELECT name,
          variant_get(sum.attributes, '$["iceberg.table.name"]', 'STRING') AS 
table_name,
          variant_get(sum.attributes, '$["iceberg.snapshot.id"]', 'BIGINT') AS 
snapshot_id,
          sum.value AS sum_value
   FROM users.noritaka_sekiyama.iceberg_otel_metrics
   WHERE variant_get(sum.attributes, '$["iceberg.table.name"]', 'STRING') = 
'zerobus_validation.test_table'
     AND time = (SELECT MAX(time) FROM 
users.noritaka_sekiyama.iceberg_otel_metrics
                 WHERE service_name = 'iceberg-zerobus-validation-optionb')
   ORDER BY name;
   ```
   
   | name | table_name | snapshot_id | sum_value |
   |---|---|---|---|
   | `iceberg.commit.attempts` | zerobus_validation.test_table | **3002** | 1.0 
|
   | `iceberg.commit.data_files.added` | zerobus_validation.test_table | 3002 | 
4.0 |
   | `iceberg.commit.data_files.removed` | zerobus_validation.test_table | 3002 
| 0.0 |
   | `iceberg.commit.file_size.added_bytes` | zerobus_validation.test_table | 
3002 | 2 048 000.0 |
   | `iceberg.commit.records.added` | zerobus_validation.test_table | 3002 | 
**12 345.0** |
   | `iceberg.scan.data_manifests.scanned` | zerobus_validation.test_table | 
**3001** | 3.0 |
   | `iceberg.scan.data_manifests.skipped` | zerobus_validation.test_table | 
3001 | 0.0 |
   | `iceberg.scan.file_size.bytes` | zerobus_validation.test_table | 3001 | 4 
096 000.0 |
   | `iceberg.scan.result.data_files` | zerobus_validation.test_table | 3001 | 
**7.0** |
   | `iceberg.scan.result.delete_files` | zerobus_validation.test_table | 3001 
| 1.0 |
   
   Every value matches what was injected: `dataFiles=7`, `deleteFiles=1`, 
`manifestsScanned=3`, `manifestsSkipped=0`, `fileSize=4 096 000B`, 
`attempts=1`, `addedDataFiles=4`, `removedDataFiles=0`, `addedRecords=12 345`, 
`addedFileSize=2 048 000B`. Snapshot ids `3001` (scan) and `3002` (commit) flow 
through to `iceberg.snapshot.id` correctly. Same for 
`iceberg.table.name=zerobus_validation.test_table`.
   
   ### 6.4 Histogram example query
   
   ```sql
   SELECT
     variant_get(h.histogram.attributes, '$["iceberg.table.name"]', 'STRING') 
AS table_name,
     h.name                                                  AS metric,
     h.histogram.count                                       AS sample_count,
     h.histogram.sum                                         AS total_ms,
     h.histogram.min                                         AS min_ms,
     h.histogram.max                                         AS max_ms,
     ROUND(h.histogram.sum / NULLIF(h.histogram.count, 0), 2) AS mean_ms,
     h.time
   FROM users.noritaka_sekiyama.iceberg_otel_metrics h
   WHERE h.service_name = 'iceberg-zerobus-validation-optionb'
     AND h.metric_type  = 'histogram'
     AND h.name LIKE 'iceberg.%'
   ORDER BY h.time DESC, h.name
   LIMIT 8;
   ```
   
   | table_name | metric | sample_count | total_ms | min_ms | max_ms | mean_ms |
   |---|---|---|---|---|---|---|
   | zerobus_validation.test_table | `iceberg.commit.duration` | 1 | 231.0 | 
231.0 | 231.0 | 231.0 |
   | zerobus_validation.test_table | `iceberg.scan.planning.duration` | 1 | 
123.0 | 123.0 | 123.0 | 123.0 |
   | zerobus_validation.test_table | `iceberg.commit.duration` | 1 | 231.0 | 
231.0 | 231.0 | 231.0 |
   | zerobus_validation.test_table | `iceberg.scan.planning.duration` | 1 | 
123.0 | 123.0 | 123.0 | 123.0 |
   | ... | (4 cycles × 2 histograms) | | | | | |
   
   Histogram `count`, `sum`, `min`, `max` round-trip correctly — `total_ms` 
matches the injected `Duration.ofMillis(231)` / `Duration.ofMillis(123)`.
   
   ---
   
   ## 7. Key differences vs Option A validation (2026-04-30)
   
   | Aspect | Option A (previous) | **Option B (this run)** |
   |---|---|---|
   | Who owns the OpenTelemetry SDK? | The reporter built its own 
`OpenTelemetrySdk` from catalog properties. | The **host** builds 
`OpenTelemetrySdk` and registers via `GlobalOpenTelemetry.set(...)`. |
   | Catalog properties on the reporter | ~7 (`otel.endpoint`, `otel.protocol`, 
`otel.headers`, `otel.service-name`, `otel.export-interval`, …). | **Zero**. 
Host configures everything. |
   | Iceberg-side configuration | `metrics-reporter-impl=…OtelMetricsReporter` 
+ several `otel.*` properties. | `metrics-reporter-impl=…OtelMetricsReporter` 
only. |
   | Smoke test in iceberg test module | `OtelEndpointSmokeTest` gated on env 
var (`OTEL_SMOKE_ENABLED`). | None. Validator lives entirely outside the 
iceberg repo (`/tmp/iceberg-otel-validation/`); nothing committed to iceberg. |
   | Reporter dependency surface | Compile dependency on OTel API + reflective 
load of OTLP exporter classes. | Compile dependency on OTel **API only**. The 
OTLP exporter, headers, retry, etc. are entirely the host's concern. |
   | Failure mode if no SDK is registered | Reporter fails to initialize. | 
`GlobalOpenTelemetry.get()` returns no-op, metrics silently dropped — matches 
the standard OTel contract. |
   
   Behaviorally on the wire, the two designs are indistinguishable: rows landed 
in Delta with the same metric names, the same Iceberg attribute keys, and the 
same numeric values.
   
   ---
   
   ## 8. Lessons learned / gotchas
   
   | Topic | Detail |
   |---|---|
   | `GlobalOpenTelemetry.set` is one-shot per JVM | Calling `set(...)` twice 
throws `IllegalStateException` unless `resetForTest()` is invoked first. The 
validator calls `resetForTest()` before `set(...)` to make in-process re-runs 
idempotent. Production hosts should set the global exactly once at startup. |
   | `Canceled` log on shutdown | When `meterProvider.close()` is called while 
an OTLP request is still in flight, the OkHttp client surfaces 
`java.io.IOException: Canceled`. This is benign — the four prior periodic 
exports had already succeeded (4 rows × 12 metrics = 48). |
   | Validator dependency surface | Outside-the-repo validator pulls in OTLP 
gRPC exporter + iceberg-core/api/common/bundled-guava jars. iceberg-core itself 
is unchanged — the test module only ships `opentelemetry-api`, 
`opentelemetry-sdk`, `opentelemetry-sdk-testing`. That's deliberate (Option B 
doesn't want to make the OTLP exporter a transitive dep of iceberg-core). |
   | OTel SDK self-monitoring metrics in 1.61 | The Delta table receives 3 
extra rows per export cycle (`otel.sdk.metric_reader.collection.duration`, 
`otlp.exporter.seen`, `otlp.exporter.exported`). Filter by `service_name` plus 
`name LIKE 'iceberg.%'` to keep dashboards focused. |
   | `iceberg.table.name` attribute | Stored as a Variant key with dots; 
`variant_get(<col>, '$["iceberg.table.name"]', 'STRING')` is needed instead of 
dotted accessor syntax. Same as Option A. |
   | Token TTL | 1 hour. For long-running services use the OTel collector's 
`oauth2clientauthextension` instead of a static bearer in `addHeader(...)`. |
   
   ---
   
   ## 9. Resources used
   
   | Kind | Name / FQN | State |
   |---|---|---|
   | Delta table | `users.noritaka_sekiyama.iceberg_otel_metrics` | retained 
(rows from Option A and Option B coexist; filter by `service_name`) |
   | Service Principal | `iceberg-otel-zerobus-sp` 
(`applicationId=b3845821-2880-477d-a5ff-e4198a45afa5`) | retained |
   | New OAuth secret | issued for this run | retained on the SP; revoke when 
no longer needed |
   | Validator project | `/tmp/iceberg-otel-validation/` (Gradle, JavaExec) | 
local only; not committed to iceberg |
   | Iceberg local jars | 
`iceberg-{api,core,common,bundled-guava}-efa838e.dirty.jar` | built via 
`./gradlew :iceberg-core:jar :iceberg-api:jar :iceberg-common:jar 
:iceberg-bundled-guava:jar` |
   
   ---
   
   **TL;DR**: Option B works end-to-end against Zerobus. The reporter's 
`initialize(emptyMap())` plus a host-side `GlobalOpenTelemetry.set(sdk)` is 
sufficient to stream every Iceberg scan and commit metric — values, histograms, 
and table/snapshot attributes — straight into a Unity Catalog Delta table. 57 
rows landed, 100% of the values match what was injected.
   
   </details>
   
   <details>
   <summary><b>Full AWS CloudWatch report</b> (architecture, PromQL, 
gotchas)</summary>
   
   # Iceberg `OtelMetricsReporter` (Option B) × AWS CloudWatch OTLP — 
Re-validation
   
   **Date**: 2026-05-01
   **Branch / commit**: `moomindani/otel-metrics-reporter` @ `efa838e7f`
   **PR**: apache/iceberg #16250
   **Result**: Success. All 12 `iceberg.*` metrics arrived in CloudWatch 
through a local OpenTelemetry Collector (SigV4-signed). Both metric values and 
attributes match the synthetic ScanReport / CommitReport injected by the runner.
   
   ---
   
   ## 1. Subject
   
   Re-validate the `OtelMetricsReporter` against AWS CloudWatch OTLP (Public 
Preview) after refactoring the reporter to **Option B** — i.e. the host 
application owns the `OpenTelemetrySdk` lifecycle and registers it via 
`GlobalOpenTelemetry.set(...)`, and the reporter exposes **zero catalog 
properties** and simply calls 
`GlobalOpenTelemetry.get().getMeter("org.apache.iceberg")`.
   
   Option A (reporter creates its own SDK from catalog properties) had already 
been validated end-to-end against CloudWatch on 2026-04-30 — see 
`~/Documents/workspace/iceberg-otel-aws-validation.md`. This document confirms 
the **current** (Option B) reporter still produces identical wire output and is 
operationally equivalent on the AWS side; the only thing that changed is who 
owns the SDK.
   
   ---
   
   ## 2. Architecture
   
   Identical to the Option A validation — the reporter only ever speaks plain 
OTLP/gRPC to a local Collector; the Collector adds SigV4 and forwards to 
CloudWatch.
   
   ```
   +-----------------------+      OTLP/gRPC          +---------------------+   
OTLP/HTTP+SigV4   +------------+
   | Iceberg (Java)        | ----------------------> | otelcol-contrib     | 
------------------> | CloudWatch |
   | OtelMetricsReporter   |  localhost:4317         | (native binary,     |    
                 | (PromQL,   |
   |  (Option B: host SDK) |                         |  sigv4authextension)|    
                 |  Query     |
   +-----------------------+                         +---------------------+    
                 |  Studio)   |
                                                             |                  
                 +------------+
                                                             | (debug exporter 
for tracing)
                                                             v
                                                          stdout
   ```
   
   The only thing that changed between Option A and Option B is the left-hand 
box: in A, the reporter built its own `OpenTelemetrySdk` from catalog 
properties; in B, the host application is responsible for the SDK and the 
reporter is a thin "adapter" that finds the meter via `GlobalOpenTelemetry`.
   
   ---
   
   ## 3. Setup
   
   ### 3.1 Catalog configuration (Option B)
   
   ```properties
   # In Spark / Flink / Trino catalog config
   metrics-reporter-impl = org.apache.iceberg.metrics.OtelMetricsReporter
   ```
   
   That is the **entire** Iceberg-side configuration. No endpoint, no headers, 
no service name, no exporter type, no batch interval — **zero properties**. 
Everything is owned by the host process's `OpenTelemetrySdk`.
   
   ### 3.2 Host-side SDK registration (excerpt of validator)
   
   The host process is expected to do something like this once at startup. The 
validator does it in `main()` to mimic a host's bootstrap:
   
   ```java
   Resource resource =
       Resource.getDefault().toBuilder()
           .put(AttributeKey.stringKey("service.name"), 
"iceberg-aws-validation-optionb")
           .build();
   
   OtlpGrpcMetricExporter exporter =
       OtlpGrpcMetricExporter.builder()
           .setEndpoint("http://localhost:4317";)
           .setTimeout(Duration.ofSeconds(10))
           .build();
   
   SdkMeterProvider meterProvider =
       SdkMeterProvider.builder()
           .setResource(resource)
           .registerMetricReader(
               PeriodicMetricReader.builder(exporter)
                   .setInterval(Duration.ofSeconds(2))
                   .build())
           .build();
   
   OpenTelemetrySdk sdk = 
OpenTelemetrySdk.builder().setMeterProvider(meterProvider).build();
   
   GlobalOpenTelemetry.resetForTest();   // makes the registration idempotent
   GlobalOpenTelemetry.set(sdk);
   ```
   
   In production a host would normally register the SDK via the official 
OpenTelemetry Java agent, `AutoConfiguredOpenTelemetrySdk`, or 
`buildAndRegisterGlobal()`. The reporter does not care which path is used; it 
only reads `GlobalOpenTelemetry.get()`.
   
   ### 3.3 Iceberg-side wiring
   
   ```java
   OtelMetricsReporter reporter = new OtelMetricsReporter();         // no-arg 
ctor
   reporter.initialize(Collections.emptyMap());                       // zero 
properties
   reporter.report(scanReport);
   reporter.report(commitReport);
   ```
   
   Synthetic reports were the same shape as `TestOtelMetricsReporter` but with 
the values described in the task spec:
   
   - `ScanReport`: snapshotId=2001, schemaId=1, projectedFieldIds=[1,2], 
projectedFieldNames=["id","data"], resultDataFiles=7, resultDeleteFiles=1, 
scannedDataManifests=3, skippedDataManifests=0, totalFileSizeInBytes=4_096_000, 
totalPlanningDuration=123ms, tableName="aws_validation.test_table"
   - `CommitReport`: snapshotId=2002, sequenceNumber=2, operation="append", 
attempts=1, addedDataFiles=4, removedDataFiles=0, addedRecords=12345, 
addedFilesSizeInBytes=2_048_000, totalDuration=231ms, 
tableName="aws_validation.test_table"
   
   ### 3.4 Where the validator lives
   
   - Build runner: `/tmp/iceberg-otel-validation/build.gradle`
   - Validator: 
`/tmp/iceberg-otel-validation/src/main/java/AwsCloudWatchValidator.java`
   - PromQL probe: `/tmp/iceberg-otel-validation/promql_query.py`
   - Run: `cd /tmp/iceberg-otel-validation && 
JAVA_HOME=$(/usr/libexec/java_home -v 17) 
/Users/noritaka.sekiyama/Documents/workspace/iceberg/gradlew run`
   
   The validator depends on the locally built Iceberg jars 
(`iceberg-core-efa838e.dirty.jar`, `iceberg-api-efa838e.dirty.jar`, 
`iceberg-bundled-guava-efa838e.dirty.jar`) plus the public OTel SDK + OTLP 
exporter from Maven Central. **Nothing was committed to the Iceberg repo.**
   
   ### 3.5 Collector
   
   Reused the binary and config from the Option A run:
   
   - Binary: 
`~/Documents/workspace/iceberg-otel-aws-validation/otelcol-contrib` (v0.151.0, 
darwin_arm64, upstream `otelcol-contrib`).
   - Config: 
`~/Documents/workspace/iceberg-otel-aws-validation/otel-config.yaml` (already 
SigV4-configured for `monitoring` / us-west-2). No edits required — Collector 
has no concept of "Option A vs Option B"; it just receives OTLP and signs the 
egress.
   - Launch:
   
     ```bash
     eval "$(aws configure export-credentials --profile 
332745928618_databricks-sandbox-admin --format env)"
     export AWS_REGION=us-west-2
     cd ~/Documents/workspace/iceberg-otel-aws-validation
     ./otelcol-contrib --config otel-config.yaml > collector.log 2>&1 &
     ```
   
   Verified `4317` and `4318` were listening before running the validator.
   
   ---
   
   ## 4. Runner output
   
   ```
   [validator] Option B validation starting (host owns SDK).
   [validator] GlobalOpenTelemetry registered with 
service.name=iceberg-aws-validation-optionb
   [validator] OtelMetricsReporter initialized via Global SDK lookup.
   [main] INFO org.apache.iceberg.metrics.OtelMetricsReporter - 
OtelMetricsReporter initialized. SDK lifecycle is owned by the host application 
(via GlobalOpenTelemetry).
   [validator] ScanReport reported (snapshotId=2001, dataFiles=7).
   [validator] CommitReport reported (snapshotId=2002, records=12345).
   [validator] Sleeping 8s to allow periodic flush...
   [validator] meterProvider closed.
   [validator] Done.
   
   BUILD SUCCESSFUL
   ```
   
   The reporter logs its initialization explicitly stating it does **not** own 
the SDK — exactly the contract Option B promises.
   
   ---
   
   ## 5. Collector receipt (debug exporter)
   
   Aggregate counts during the 8-second window of the run:
   
   ```
   2026-05-08T21:49:33.387+0900  info  Metrics  ... resource metrics: 5, 
metrics: 72, data points: 72
   ```
   
   Resource attributes attached to every export (verifies the host-side 
resource flowed through unchanged):
   
   ```
   Resource attributes:
        -> service.name: Str(iceberg-aws-validation-optionb)
        -> telemetry.sdk.language: Str(java)
        -> telemetry.sdk.name: Str(opentelemetry)
        -> telemetry.sdk.version: Str(1.61.0)
   ScopeMetrics #0
   ScopeMetrics SchemaURL:
   InstrumentationScope org.apache.iceberg
   ```
   
   `InstrumentationScope = org.apache.iceberg` confirms the reporter used the 
meter name baked into `OtelMetricsReporter.INSTRUMENTATION_NAME`.
   
   Per-metric verification — `iceberg.scan.result.data_files`:
   
   ```
        -> Name: iceberg.scan.result.data_files
        -> Description: Number of data files included in scan result
        -> Unit:
        -> DataType: Sum
        -> IsMonotonic: true
        -> AggregationTemporality: Cumulative
   NumberDataPoints #0
   Data point attributes:
        -> iceberg.schema.id: Int(1)
        -> iceberg.snapshot.id: Int(2001)
        -> iceberg.table.name: Str(aws_validation.test_table)
   StartTimestamp: 2026-05-08 12:49:23.892955 +0000 UTC
   Timestamp:      2026-05-08 12:49:25.865293 +0000 UTC
   Value: 7
   ```
   
   `iceberg.commit.records.added`:
   
   ```
        -> Name: iceberg.commit.records.added
        -> Description: Number of records added by commit
        -> Unit:
        -> DataType: Sum
        -> IsMonotonic: true
        -> AggregationTemporality: Cumulative
   NumberDataPoints #0
   Data point attributes:
        -> iceberg.operation: Str(append)
        -> iceberg.snapshot.id: Int(2002)
        -> iceberg.table.name: Str(aws_validation.test_table)
   StartTimestamp: 2026-05-08 12:49:23.894813 +0000 UTC
   Timestamp:      2026-05-08 12:49:25.865293 +0000 UTC
   Value: 12345
   ```
   
   All 12 `iceberg.*` metrics show up in the log with the expected attribute 
keys (`iceberg.snapshot.id`, `iceberg.table.name`, `iceberg.schema.id`, 
`iceberg.operation`) and values matching the synthetic reports.
   
   ---
   
   ## 6. CloudWatch verification (PromQL)
   
   Queries were issued against `https://monitoring.us-west-2.amazonaws.com` 
with SigV4 signing. The runner script is 
`/tmp/iceberg-otel-validation/promql_query.py`.
   
   ### 6.1 Confirm the 12 series exist
   
   ```
   GET /api/v1/label/__name__/values
   ```
   
   Response (filtered to `iceberg.*`, sorted):
   
   ```
   iceberg.commit.attempts
   iceberg.commit.data_files.added
   iceberg.commit.data_files.removed
   iceberg.commit.duration
   iceberg.commit.file_size.added_bytes
   iceberg.commit.records.added
   iceberg.scan.data_manifests.scanned
   iceberg.scan.data_manifests.skipped
   iceberg.scan.file_size.bytes
   iceberg.scan.planning.duration
   iceberg.scan.result.data_files
   iceberg.scan.result.delete_files
   ```
   
   All 12 are present. (CloudWatch retains `iceberg.*` series across runs, so 
the Option A run from 2026-04-30 also contributes here — but the per-series 
data below filters by `@resource.service.name=iceberg-aws-validation-optionb` 
so we only see Option B samples.)
   
   ### 6.2 Per-metric value verification
   
   PromQL (instant query, wrapped in `last_over_time(...[1h])` because the run 
was a one-shot rather than continuous):
   
   ```promql
   last_over_time({__name__="iceberg.scan.result.data_files"}[1h])
   ```
   
   Response (filtered to the Option B series via `@resource.service.name`):
   
   ```json
   {
     "metric": {
       "__name__": "iceberg.scan.result.data_files",
       "__type__": "Sum",
       "__monotonicity__": "true",
       "__temporality__": "cumulative",
       "@resource.service.name": "iceberg-aws-validation-optionb",
       "@resource.telemetry.sdk.language": "java",
       "@resource.telemetry.sdk.name": "opentelemetry",
       "@resource.telemetry.sdk.version": "1.61.0",
       "@instrumentation.@name": "org.apache.iceberg",
       "@aws.account": "332745928618",
       "@aws.region": "us-west-2",
       "iceberg.schema.id": "1",
       "iceberg.snapshot.id": "2001",
       "iceberg.table.name": "aws_validation.test_table"
     },
     "value": [<timestamp>, "7"]
   }
   ```
   
   Result: `value = 7` — exactly the synthetic input. All resource attributes 
(`service.name=iceberg-aws-validation-optionb`, `telemetry.sdk.language=java`, 
etc.) plus the data-point attributes (`iceberg.snapshot.id=2001`, 
`iceberg.schema.id=1`, `iceberg.table.name=aws_validation.test_table`) 
round-trip cleanly.
   
   ```promql
   last_over_time({__name__="iceberg.commit.records.added"}[1h])
   ```
   
   → `value = 12345`, with attributes:
   - `@resource.service.name = iceberg-aws-validation-optionb`
   - `iceberg.snapshot.id = 2002`
   - `iceberg.operation = append`
   - `iceberg.table.name = aws_validation.test_table`
   
   ```promql
   last_over_time({__name__="iceberg.scan.file_size.bytes"}[1h])
   ```
   
   → `value = 4096000`, `__unit__ = By` (from the `setUnit("By")` on the OTel 
histogram builder).
   
   ```promql
   last_over_time({__name__="iceberg.commit.attempts"}[1h])
   ```
   
   → `value = 1`.
   
   Summary table:
   
   | Metric                            | Expected | Observed in CloudWatch | 
Match |
   
|-----------------------------------|---------:|-----------------------:|-------|
   | `iceberg.scan.result.data_files`  | 7        | 7                      | 
yes   |
   | `iceberg.commit.records.added`    | 12345    | 12345                  | 
yes   |
   | `iceberg.scan.file_size.bytes`    | 4096000  | 4096000                | 
yes   |
   | `iceberg.commit.attempts`         | 1        | 1                      | 
yes   |
   
   The promql script returns `ALL_MATCH`.
   
   ---
   
   ## 7. Differences vs the Option A validation
   
   | Concern                                       | Option A (2026-04-30)      
                                             | Option B (2026-05-01, this run)  
                                     |
   
|-----------------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------|
   | Who builds `OpenTelemetrySdk`?                | The reporter, from catalog 
properties (`otel.metrics-reporter.*`).      | The host application, before the 
catalog is loaded.                    |
   | Catalog properties read by reporter           | ~10 (`endpoint`, 
`protocol`, `headers`, `service-name`, `interval`, …). | **Zero.**              
                                                |
   | Reporter calls                                | 
`OpenTelemetrySdk.builder()...buildAndRegisterGlobal()`                  | 
`GlobalOpenTelemetry.get().getMeter("org.apache.iceberg")`             |
   | `close()` semantics                           | Reporter shut down the SDK 
it built.                                    | Reporter does not own anything → 
`close()` is a no-op.                 |
   | Smoke test (env-gated `OtelEndpointSmokeTest`)| Present.                   
                                             | **Removed.** Host SDK 
construction is the host's responsibility.       |
   | Validator runner                              | Lived inside the iceberg 
test sourceset (uncommitted).                  | Lives entirely in 
`/tmp/iceberg-otel-validation/` (standalone Gradle). |
   | Wire output (Collector + CloudWatch)          | Identical to Option B (12 
`iceberg.*` series, same attributes).         | Identical to Option A (12 
`iceberg.*` series, same attributes).        |
   | AWS-side artifacts                            | None created.              
                                             | None created.                    
                                      |
   
   Importantly the **on-the-wire output is identical** — same metric names, 
same descriptions/units, same attribute keys, same values. Reviewers can 
convince themselves of this by comparing the data-point excerpts in section 5 
of this report against section 6 of the Option A report.
   
   ---
   
   ## 8. Lessons learned / new gotchas
   
   | Topic                                              | Detail                
                                                                                
                                                                                
                                                                                
                                                          |
   
|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   | `GlobalOpenTelemetry.set` is "set-once"            | It throws if called 
twice. Tests that run `initialize()` may already have set the global; the 
validator calls `GlobalOpenTelemetry.resetForTest()` first to be safe. 
Production hosts wouldn't normally hit this — they register exactly once at 
boot.                                                                          |
   | No catalog property knobs                          | A direct consequence 
of Option B is that there is nothing to misconfigure on the Iceberg side. If 
metrics aren't flowing, the diagnostic is entirely on the host SDK side (env 
vars, agent attached, `GlobalOpenTelemetry.set` actually called, etc.). Doc 
should make this explicit so users don't go hunting for non-existent `otel.*` 
catalog keys.                              |
   | `boto3` SigV4 query signing                        | `AWSRequest(url=..., 
params=...)` is required for `GET /api/v1/query?query=...`. Building the URL 
string yourself and then signing only the path silently produces a wrong 
signature → HTTP 400 with body `{"message":null}`. Use `req.prepare().url` so 
the canonical query string matches what's actually sent.                |
   | PromQL UTF-8 label-name selectors                  | CloudWatch's PromQL 
preview rejects both back-tick (`` `@resource.service.name` ``) and 
double-quoted UTF-8 names inside the matcher braces. Stick to plain 
`{__name__="..."}` selectors and post-filter on `@resource.*` labels 
client-side, or rely on the `last_over_time(...[1h])` window for one-shot smoke 
tests.          |
   | `otlphttp` exporter is now `otlp_http`             | The Collector 
v0.151.0 prints a deprecation warning on the `otlphttp:` key. Functional today; 
rename for the next config edit.                                                
                                                                                
                                                                  |
   | CloudWatch retains all prior runs                  | The 12 `iceberg.*` 
series from Option A are still visible alongside Option B. Always filter PromQL 
responses by `@resource.service.name` (or another distinguishing resource 
attribute) when validating a specific run.                                      
                                                                    |
   | `@instrumentation.@name` label is double-`@`       | CloudWatch flattens 
OTel scope name as `@instrumentation.@name=org.apache.iceberg` (note the 
leading `@` plus the field's own `@name` key). Cosmetic but surprising — it's 
helpful as a "reporter fingerprint" filter 
(`@instrumentation.@name="org.apache.iceberg"`) when many SDKs share a 
workspace.                          |
   
   The previously-documented gotchas (region availability, no namespace 
concept, ADOT vs upstream Collector, `otlphttp` endpoint format, label naming) 
all still apply unchanged.
   
   ---
   
   ## 9. Status of resources
   
   | Kind                          | Location                                   
                                                    | State    |
   
|-------------------------------|------------------------------------------------------------------------------------------------|----------|
   | OTel Collector binary         | 
`~/Documents/workspace/iceberg-otel-aws-validation/otelcol-contrib`             
               | retained |
   | OTel Collector config         | 
`~/Documents/workspace/iceberg-otel-aws-validation/otel-config.yaml`            
               | retained |
   | Validator (Gradle project)    | `/tmp/iceberg-otel-validation/`            
                                                    | retained for reference |
   | CloudWatch metrics            | `iceberg.*` series in account 
`332745928618`, region `us-west-2`, 
`service.name=iceberg-aws-validation-optionb` | retained until CloudWatch's 
default retention expires |
   | AWS resources                 | none created                               
                                                    | —        |
   | Iceberg repo                  | no commits, no uncommitted source files 
added under `core/src/`                                | clean    |
   
   ---
   
   ## 10. Conclusion
   
   The Option B refactor of `OtelMetricsReporter` is operationally 
indistinguishable from Option A on the AWS side. All 12 `iceberg.*` series flow 
into CloudWatch with the same names, units, descriptions, attribute keys, and 
exact values as before. The migration to "host owns SDK" cost the user-visible 
catalog properties (now zero) and the smoke test (now redundant); it gained a 
much smaller surface area inside the Iceberg core. Reviewers can land Option B 
with the same backend confidence Option A had.
   
   </details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


Reply via email to