Good point, the changes are contained to the theta sketch definition in this case. Nothing is changing in the Puffin metadata itself in this change. I missed that in the spec we already versioned the blob type for example: https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type.
Considering this, the update should probably be more granular at the sketch level rather than a Puffin file format upgrade. If I recall correctly, this was intentionally a design in Puffin so that new blob types could be added without necessarily needing an entire format upgrade. Thanks, Amogh Jahagirdar On 2024/07/05 17:18:56 Ryan Blue wrote: > For the compatibility and version bump question, shouldn't this be a new > revision of the theta sketch rather than a new version of Puffin? I think > we want to avoid making changes to the file format itself if we can contain > the changes to the sketch definitions. > > +1 for raising a vote once we have a PR with the full changes. > > On Fri, Jul 5, 2024 at 9:35 AM Amogh Jahagirdar <amo...@apache.org> wrote: > > > Hey all, > > > > Thanks everyone for providing your input on this discussion thread. There > > are 2 points I'd like to bring up, which in hindsight I should've > > recognized. > > > > If we look at the current Iceberg Improvement Process > > https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals > > and how proposals are adopted > > https://iceberg.apache.org/contribute/#how-are-proposals-adopted, since > > this counts as spec change it should go through a formal vote process as JB > > alluded to. > > > > Next, even though this spec change may be considered trivial and Puffin is > > still in the early stages of adoption, the Puffin spec v1 as it is > > currently written was already voted on, see: > > https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q. > > Since the change I'm proposing is converting an optional property into > > required, what this probably means is we should add this as part of a > > Puffin spec v2 (like Szehon was referring to). Thinking about this more, I > > believe this is important because going forward it should be unambiguous on > > what the practice is when making spec changes like this. > > > > What I'm now proposing: > > > > 1.) Start a vote on making the NDV property required for theta sketch > > blobs in a Puffin V2 format. Since there seems to be general support on > > this thread about making this property required, I think we can go into a > > vote, but to be very clear this requirement would apply for the next Puffin > > version. > > > > 2.) For the current PR for analyze table > > https://github.com/apache/iceberg/pull/10288 and other engine > > integrations with Puffin, they'd have to follow the spec as it is today > > which means doing a best effort read of the property, and if it does not > > exist, derive the NDV from the sketch. > > > > Please let me know your thoughts! > > > > Thanks, > > > > Amogh Jahagirdar > > > > On 2024/06/21 21:54:13 Amogh Jahagirdar wrote: > > > Hey all, > > > > > > I wanted to raise this thread to discuss a spec change proposal > > > <https://github.com/apache/iceberg/pull/10549> for making the ndv blob > > > metadata property required for theta sketches. Currently, the spec is a > > bit > > > loose stating: > > > > > > The blob metadata for this blob *may* include following properties: > > > > > > - ndv: estimate of number of distinct values, derived from the sketch > > > > > > > > > This came up on this PR > > > < > > https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695> > > > where > > > it came up that engines like Presto/Trino are using the property as a > > > source of truth and the implementation of the Spark procedure in the PR > > > originally was deriving the NDV from the sketch itself. It's currently > > > unclear what engine integrations should use as a source of truth. > > > > > > The main advantage of having it in the properties is that engines don't > > > have to go and deserialize the sketch/compute the NDV if they just want > > the > > > NDV (putting aside the intersection/union case where I think engines > > would > > > have to read the sketch). I think this makes it easier for engine > > > integration. The spec also currently makes it clear that the property > > must > > > be derived from the sketch so I don't think there's a "source of truth" > > > sync concern. It also should be easy for blob writers to set this > > property > > > since they'd anyways be populating the sketch in the first place. > > > > > > An alternative is to attempt to read the property and fallback to the > > > sketch (maybe abstract this behind an API) but this loses the advantage > > of > > > guaranteeing that engines don't have to read the sketch. > > > > > > The spec change to make the property required seems to be the consensus > > on > > > the PR thread but I wanted to bring it up here in case others had > > different > > > ideas or if I'm missing any problems with this approach! > > > > > > > > > Thank you, > > > > > > Amogh Jahagirdar > > > > > > > > -- > Ryan Blue > Databricks >