I also support making this field required. Thanks, Nick Riasanovsky
On Fri, Jun 21, 2024 at 7:00 PM Szehon Ho <szehon.apa...@gmail.com> wrote: > It makes sense to me, normally changing optional -> required would > probably require a version bump, but maybe it is ok here as it is a > relatively new format, afaik adapted by Trino which already sets this > field, but let's see if anyone disagrees. > > Thanks > Szehon > > On Fri, Jun 21, 2024 at 3:35 PM huaxin gao <huaxin.ga...@gmail.com> wrote: > >> +1 for making the ndv blob metadata property required for theta sketches. >> >> On Fri, Jun 21, 2024 at 2:54 PM Amogh Jahagirdar <2am...@gmail.com> >> wrote: >> >>> Hey all, >>> >>> I wanted to raise this thread to discuss a spec change proposal >>> <https://github.com/apache/iceberg/pull/10549> for making the ndv blob >>> metadata property required for theta sketches. Currently, the spec is a bit >>> loose stating: >>> >>> The blob metadata for this blob *may* include following properties: >>> >>> - ndv: estimate of number of distinct values, derived from the sketch >>> >>> >>> This came up on this PR >>> <https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695> >>> where >>> it came up that engines like Presto/Trino are using the property as a >>> source of truth and the implementation of the Spark procedure in the PR >>> originally was deriving the NDV from the sketch itself. It's currently >>> unclear what engine integrations should use as a source of truth. >>> >>> The main advantage of having it in the properties is that engines don't >>> have to go and deserialize the sketch/compute the NDV if they just want the >>> NDV (putting aside the intersection/union case where I think engines would >>> have to read the sketch). I think this makes it easier for engine >>> integration. The spec also currently makes it clear that the property must >>> be derived from the sketch so I don't think there's a "source of truth" >>> sync concern. It also should be easy for blob writers to set this property >>> since they'd anyways be populating the sketch in the first place. >>> >>> An alternative is to attempt to read the property and fallback to the >>> sketch (maybe abstract this behind an API) but this loses the advantage of >>> guaranteeing that engines don't have to read the sketch. >>> >>> The spec change to make the property required seems to be the consensus >>> on the PR thread but I wanted to bring it up here in case others had >>> different ideas or if I'm missing any problems with this approach! >>> >>> >>> Thank you, >>> >>> Amogh Jahagirdar >>> >>