Hey all, Thanks everyone for providing your input on this discussion thread. There are 2 points I'd like to bring up, which in hindsight I should've recognized.
If we look at the current Iceberg Improvement Process https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals and how proposals are adopted https://iceberg.apache.org/contribute/#how-are-proposals-adopted, since this counts as spec change it should go through a formal vote process as JB alluded to. Next, even though this spec change may be considered trivial and Puffin is still in the early stages of adoption, the Puffin spec v1 as it is currently written was already voted on, see: https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q. Since the change I'm proposing is converting an optional property into required, what this probably means is we should add this as part of a Puffin spec v2 (like Szehon was referring to). Thinking about this more, I believe this is important because going forward it should be unambiguous on what the practice is when making spec changes like this. What I'm now proposing: 1.) Start a vote on making the NDV property required for theta sketch blobs in a Puffin V2 format. Since there seems to be general support on this thread about making this property required, I think we can go into a vote, but to be very clear this requirement would apply for the next Puffin version. 2.) For the current PR for analyze table https://github.com/apache/iceberg/pull/10288 and other engine integrations with Puffin, they'd have to follow the spec as it is today which means doing a best effort read of the property, and if it does not exist, derive the NDV from the sketch. Please let me know your thoughts! Thanks, Amogh Jahagirdar On 2024/06/21 21:54:13 Amogh Jahagirdar wrote: > Hey all, > > I wanted to raise this thread to discuss a spec change proposal > <https://github.com/apache/iceberg/pull/10549> for making the ndv blob > metadata property required for theta sketches. Currently, the spec is a bit > loose stating: > > The blob metadata for this blob *may* include following properties: > > - ndv: estimate of number of distinct values, derived from the sketch > > > This came up on this PR > <https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695> > where > it came up that engines like Presto/Trino are using the property as a > source of truth and the implementation of the Spark procedure in the PR > originally was deriving the NDV from the sketch itself. It's currently > unclear what engine integrations should use as a source of truth. > > The main advantage of having it in the properties is that engines don't > have to go and deserialize the sketch/compute the NDV if they just want the > NDV (putting aside the intersection/union case where I think engines would > have to read the sketch). I think this makes it easier for engine > integration. The spec also currently makes it clear that the property must > be derived from the sketch so I don't think there's a "source of truth" > sync concern. It also should be easy for blob writers to set this property > since they'd anyways be populating the sketch in the first place. > > An alternative is to attempt to read the property and fallback to the > sketch (maybe abstract this behind an API) but this loses the advantage of > guaranteeing that engines don't have to read the sketch. > > The spec change to make the property required seems to be the consensus on > the PR thread but I wanted to bring it up here in case others had different > ideas or if I'm missing any problems with this approach! > > > Thank you, > > Amogh Jahagirdar >