+1 for making the ndv blob metadata property required for theta sketches.

On Fri, Jun 21, 2024 at 2:54 PM Amogh Jahagirdar <2am...@gmail.com> wrote:

> Hey all,
>
> I wanted to raise this thread to discuss a spec change proposal
> <https://github.com/apache/iceberg/pull/10549> for making the ndv blob
> metadata property required for theta sketches. Currently, the spec is a bit
> loose stating:
>
> The blob metadata for this blob *may* include following properties:
>
>    - ndv: estimate of number of distinct values, derived from the sketch
>
>
> This came up on this PR
> <https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695> 
> where
> it came up that engines like Presto/Trino are using the property as a
> source of truth and the implementation of the Spark procedure in the PR
> originally was deriving the NDV from the sketch itself. It's currently
> unclear what engine integrations should use as a source of truth.
>
> The main advantage of having it in the properties is that engines don't
> have to go and deserialize the sketch/compute the NDV if they just want the
> NDV (putting aside the intersection/union case where I think engines would
> have to read the sketch). I think this makes it easier for engine
> integration. The spec also currently makes it clear that the property must
> be derived from the sketch so I don't think there's a "source of truth"
> sync concern. It also should be easy for blob writers to set this property
> since they'd anyways be populating the sketch in the first place.
>
> An alternative is to attempt to read the property and fallback to the
> sketch (maybe abstract this behind an API) but this loses the advantage of
> guaranteeing that engines don't have to read the sketch.
>
> The spec change to make the property required seems to be the consensus on
> the PR thread but I wanted to bring it up here in case others had different
> ideas or if I'm missing any problems with this approach!
>
>
> Thank you,
>
> Amogh Jahagirdar
>

Reply via email to