Hey all, I wanted to raise this thread to discuss a spec change proposal <https://github.com/apache/iceberg/pull/10549> for making the ndv blob metadata property required for theta sketches. Currently, the spec is a bit loose stating:
The blob metadata for this blob *may* include following properties: - ndv: estimate of number of distinct values, derived from the sketch This came up on this PR <https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695> where it came up that engines like Presto/Trino are using the property as a source of truth and the implementation of the Spark procedure in the PR originally was deriving the NDV from the sketch itself. It's currently unclear what engine integrations should use as a source of truth. The main advantage of having it in the properties is that engines don't have to go and deserialize the sketch/compute the NDV if they just want the NDV (putting aside the intersection/union case where I think engines would have to read the sketch). I think this makes it easier for engine integration. The spec also currently makes it clear that the property must be derived from the sketch so I don't think there's a "source of truth" sync concern. It also should be easy for blob writers to set this property since they'd anyways be populating the sketch in the first place. An alternative is to attempt to read the property and fallback to the sketch (maybe abstract this behind an API) but this loses the advantage of guaranteeing that engines don't have to read the sketch. The spec change to make the property required seems to be the consensus on the PR thread but I wanted to bring it up here in case others had different ideas or if I'm missing any problems with this approach! Thank you, Amogh Jahagirdar