Hi Amogh +1 to have ndv blob metadata property required.
As discussed during the last community meeting, we discussed voting on all code modification changes on spec. For the next spec changes, I would propose to start a voting thread, like "[VOTE][Puffin Spec] Make the ndv blob metadata property required for theta sketches". Thanks ! Regards JB On Fri, Jun 21, 2024 at 11:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote: > > Hey all, > > I wanted to raise this thread to discuss a spec change proposal for making > the ndv blob metadata property required for theta sketches. Currently, the > spec is a bit loose stating: > > The blob metadata for this blob may include following properties: > > ndv: estimate of number of distinct values, derived from the sketch > > > This came up on this PR where it came up that engines like Presto/Trino are > using the property as a source of truth and the implementation of the Spark > procedure in the PR originally was deriving the NDV from the sketch itself. > It's currently unclear what engine integrations should use as a source of > truth. > > The main advantage of having it in the properties is that engines don't have > to go and deserialize the sketch/compute the NDV if they just want the NDV > (putting aside the intersection/union case where I think engines would have > to read the sketch). I think this makes it easier for engine integration. The > spec also currently makes it clear that the property must be derived from the > sketch so I don't think there's a "source of truth" sync concern. It also > should be easy for blob writers to set this property since they'd anyways be > populating the sketch in the first place. > > An alternative is to attempt to read the property and fallback to the sketch > (maybe abstract this behind an API) but this loses the advantage of > guaranteeing that engines don't have to read the sketch. > > The spec change to make the property required seems to be the consensus on > the PR thread but I wanted to bring it up here in case others had different > ideas or if I'm missing any problems with this approach! > > > Thank you, > > Amogh Jahagirdar