Hi all,

I wanted to share a proof-of-concept tool I've been building called Iceberg
Doctor, and get the community's thoughts on it.

The motivation stemmed from managing Iceberg tables at Confluent (via
Tableflow).

When users told us their table was expensive to query or behaving
unexpectedly, there wasn't a clean way to investigate what was actually
going on inside the metadata. Since we often did not own the buckets, we
had to ask customers to run queries, we would investigate, ask them to run
more queries, etc.

I also felt there was an opportunity to make Iceberg's internal structure
more accessible to multiple parties without needing constant access to the
actual storage.

Iceberg Doctor is a forensics and visualization toolchain patterned loosely
after Java Flight Recorder. A collector reads Iceberg metadata (root
metadata.json → manifest-list avros → manifest avros) and produces a single
self-contained artifact (.icediag.json). A separate static web viewer
renders that artifact through multiple lenses:

- A snapshot DAG (left-to-right, oldest → newest) with nodes colored by
operation type and sized by record count
- A Findings panel that runs detection rules automatically and surfaces
problems ranked by severity with suggested remediations things like:
manifest bloat, delete accumulation, high read amplification, compaction
staleness, and more
- A Charts overlay with time-series across snapshot history (file
composition, manifest counts, amplification p50/p95/max)
- Per-manifest drill-down with file counts, size distribution, and
amplification rollup

The idea is that you can capture this static file at any point, allowing
multiple visualizations or analyses to be done on it instead of traversing
the metadata tree across multiple files.

GitHub: https://github.com/sarthaksin1857/IcebergDoctor (Pictures and
examples are here)

Current status: this is a POC. The collector and viewer are functional
against real table metadata. Direct object storage integration isn't done
yet; currently, you must pass a local metadata directory manually. My next
planned step is to support passing a path directly (S3, GCS, etc.) so the
tool can collect remotely without a local copy.

I'd love to hear:

1. Whether this kind of tooling would be useful to others in the community
2. Any thoughts on the detection rules or findings? Are there patterns
you'd want surfaced that aren't here yet?
3. Ideas for improvement, especially around the viewer or artifact format

Best,
Sarthak

Reply via email to