Hi all, I wanted to share a proof-of-concept tool I've been building called Iceberg Doctor, and get the community's thoughts on it.
The motivation stemmed from managing Iceberg tables at Confluent (via Tableflow). When users told us their table was expensive to query or behaving unexpectedly, there wasn't a clean way to investigate what was actually going on inside the metadata. Since we often did not own the buckets, we had to ask customers to run queries, we would investigate, ask them to run more queries, etc. I also felt there was an opportunity to make Iceberg's internal structure more accessible to multiple parties without needing constant access to the actual storage. Iceberg Doctor is a forensics and visualization toolchain patterned loosely after Java Flight Recorder. A collector reads Iceberg metadata (root metadata.json → manifest-list avros → manifest avros) and produces a single self-contained artifact (.icediag.json). A separate static web viewer renders that artifact through multiple lenses: - A snapshot DAG (left-to-right, oldest → newest) with nodes colored by operation type and sized by record count - A Findings panel that runs detection rules automatically and surfaces problems ranked by severity with suggested remediations things like: manifest bloat, delete accumulation, high read amplification, compaction staleness, and more - A Charts overlay with time-series across snapshot history (file composition, manifest counts, amplification p50/p95/max) - Per-manifest drill-down with file counts, size distribution, and amplification rollup The idea is that you can capture this static file at any point, allowing multiple visualizations or analyses to be done on it instead of traversing the metadata tree across multiple files. GitHub: https://github.com/sarthaksin1857/IcebergDoctor (Pictures and examples are here) Current status: this is a POC. The collector and viewer are functional against real table metadata. Direct object storage integration isn't done yet; currently, you must pass a local metadata directory manually. My next planned step is to support passing a path directly (S3, GCS, etc.) so the tool can collect remotely without a local copy. I'd love to hear: 1. Whether this kind of tooling would be useful to others in the community 2. Any thoughts on the detection rules or findings? Are there patterns you'd want surfaced that aren't here yet? 3. Ideas for improvement, especially around the viewer or artifact format Best, Sarthak
