I think my email last week didn't go through, but to follow up on this I've opened a PR with the current spec draft: https://github.com/apache/iceberg/pull/16518
That PR does not add it to the table format or commit us to using mumbling; it is just for coordination. We will want to have a vote to adopt the spec when it is time to make those decisions. In the meantime, having the spec in our repo unblocks the adaptive metadata that Amogh is working on as well as spec changes as we build a reference implementation. This should give us more information about how complex or difficult it is to implement the spec across languages. My proof of concept was a small amount of code, but this will give us more information. Also, I wouldn't necessarily want to wait for this to be adopted by the Roaring community. The trade-off of using descriptor bytes (documented in the "Design decisions" section) may not align with the model that community wants, although I'm fairly certain that it's the right choice for our use case. Another is that it may not get traction or could take a while. I think we should be comfortable moving forward on our own, although I think it's likely that this will make it upstream. Committing the draft and working on a reference implementation will give us more information about how difficult it would be to maintain if it is not added upstream. Ryan On Fri, May 15, 2026 at 4:08 PM Ryan Blue <[email protected]> wrote: > It sounds like we're mostly aligned on this so I'd like to keep moving > forward. > > I'll open a PR to commit the spec to the Iceberg repo so we can keep it > versioned and build implementations. This is not committing to using the > format, it is just for coordination. I expect that we will want to have a > vote to adopt the spec when we want to make that decision. This should > unblock implementations as well as the tentative spec changes for adaptive > metadata that Amogh is working on. > > This should give us more information about how complex or difficult it is > to implement the spec across languages. My proof of concept was pretty > small, but we can see if there are issues. > > One thing I wouldn't necessarily want to do is wait for this to be adopted > by the Roaring community. For one thing, the trade-off of using descriptor > bytes (documented in the "Design decisions" section) may not align with the > model that community wants, although I'm fairly certain that it's the right > choice for our use case. Another is that it may just not get traction or > could take a while. I think we should be comfortable moving forward on our > own (or not). Still, we can get a draft spec committed and go from there to > get more information on how much work it is to build and maintain. > > Ryan > > On Tue, Apr 21, 2026 at 7:07 AM Andrei Tserakhau via dev < > [email protected]> wrote: > >> I could argue here that other languages should not be a blocker here. >> >> I can speak on behalf of iceberg-go, implementing this as native feature >> there is doable thing. >> >> Implementing Mumbling in Go natively worst case it's ~1–2 weeks of >> isolated work in a new internal package; best case (Roaring upstream >> accepts it) it's days of glue code. I can assume the same cost for other >> languages (java and cpp primarily). >> >> There is no language-specific risk here — the format is deliberately >> simple, the Rust prototype is small, the spec has concrete byte-level test >> vectors, and it does not touch any other iceberg-go packages. >> >> Best, >> Andrei >> >> вт, 21 апр. 2026 г., 15:30 Maximilian Michels <[email protected]>: >> >>> Hi Ryan, >>> >>> Thanks for the detailed analysis. The storage savings (for the sparse >>> range) and the general memory savings over Roaring are compelling. >>> >>> My main concern would be having to maintain our own bitmap format >>> across all implementations of Iceberg. I suppose it would be mainly >>> Java and Rust, as we can leverage Rust bindings for other languages, >>> but Roaring already has implementations for every language Iceberg >>> supports today. >>> >>> If we can include Mumbling as part of Roaring, this becomes a no-brainer. >>> >>> -Max >>> >>> On Tue, Apr 21, 2026 at 1:02 AM Ryan Blue <[email protected]> wrote: >>> > >>> > Hi everyone, >>> > >>> > For the v4 adaptive metadata tree work, we are planning on embedding >>> bitmaps in the root manifest that act as metadata/manifest deletion vectors >>> (MDVs). Amogh looked into how much space this would take in the manifests >>> and we found that the Roaring format is pretty large at the scale we're >>> targeting. When we compare it to raw bitmaps, we would be storing an extra >>> 500-2,000 bytes per bitmap. As a result, I tried to see if we could use the >>> ideas from Roaring, but with smaller containers to fit better with our more >>> limited use case: manifests that contain roughly 50,000 entries (a single >>> Roaring container). Since it is like Roaring but smaller, I've been calling >>> the new format Mumbling. >>> > >>> > You can view the results comparing Roaring, raw bitmaps, and Mumbling. >>> The results look promising: compressed sizes track much more closely to the >>> raw bitmap and the format has smaller overhead in memory than even Roaring >>> because of the more granular containers. >>> > >>> > The next steps are to discuss whether we want to use this format. To >>> do that, I've written up a Mumbling spec document so that it is clear what >>> exactly the format is doing. That should help us evaluate the design >>> choices and the cost of implementing this. >>> > >>> > I think that we should move forward with this bitmap format. It would >>> save quite a bit of space in the root manifest and it is a fairly simple >>> spec. My size tests used an implementation in Rust that is fairly compact >>> so it is not a huge amount of work. I've also reached out and we may be >>> able to partner with the Roaring community to make this a part of the >>> larger standard. >>> > >>> > Please take a look and discuss. Thanks, >>> > >>> > Ryan >>> >>
