Hi Amogh

+1 to have ndv blob metadata property required.

As discussed during the last community meeting, we discussed voting on
all code modification changes on spec.
For the next spec changes, I would propose to start a voting thread,
like "[VOTE][Puffin Spec] Make the ndv blob metadata property required
for theta sketches".

Thanks !
Regards
JB

On Fri, Jun 21, 2024 at 11:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote:
>
> Hey all,
>
> I wanted to raise this thread to discuss a spec change proposal for making 
> the ndv blob metadata property required for theta sketches. Currently, the 
> spec is a bit loose stating:
>
> The blob metadata for this blob may include following properties:
>
> ndv: estimate of number of distinct values, derived from the sketch
>
>
> This came up on this PR where it came up that engines like Presto/Trino are 
> using the property as a source of truth and the implementation of the Spark 
> procedure in the PR originally was deriving the NDV from the sketch itself. 
> It's currently unclear what engine integrations should use as a source of 
> truth.
>
> The main advantage of having it in the properties is that engines don't have 
> to go and deserialize the sketch/compute the NDV if they just want the NDV 
> (putting aside the intersection/union case where I think engines would have 
> to read the sketch). I think this makes it easier for engine integration. The 
> spec also currently makes it clear that the property must be derived from the 
> sketch so I don't think there's a "source of truth" sync concern. It also 
> should be easy for blob writers to set this property since they'd anyways be 
> populating the sketch in the first place.
>
> An alternative is to attempt to read the property and fallback to the sketch 
> (maybe abstract this behind an API) but this loses the advantage of 
> guaranteeing that engines don't have to read the sketch.
>
> The spec change to make the property required seems to be the consensus on 
> the PR thread but I wanted to bring it up here in case others had different 
> ideas or if I'm missing any problems with this approach!
>
>
> Thank you,
>
> Amogh Jahagirdar

Reply via email to