Making the NDV property required for theta sketch blobs in Puffin

Amogh Jahagirdar Fri, 21 Jun 2024 14:54:31 -0700

Hey all,

I wanted to raise this thread to discuss a spec change proposal
<https://github.com/apache/iceberg/pull/10549> for making the ndv blob
metadata property required for theta sketches. Currently, the spec is a bit
loose stating:


The blob metadata for this blob *may* include following properties:

   - ndv: estimate of number of distinct values, derived from the sketch


This came up on this PR
<https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695>
where
it came up that engines like Presto/Trino are using the property as a
source of truth and the implementation of the Spark procedure in the PR
originally was deriving the NDV from the sketch itself. It's currently
unclear what engine integrations should use as a source of truth.

The main advantage of having it in the properties is that engines don't
have to go and deserialize the sketch/compute the NDV if they just want the
NDV (putting aside the intersection/union case where I think engines would
have to read the sketch). I think this makes it easier for engine
integration. The spec also currently makes it clear that the property must
be derived from the sketch so I don't think there's a "source of truth"
sync concern. It also should be easy for blob writers to set this property
since they'd anyways be populating the sketch in the first place.

An alternative is to attempt to read the property and fallback to the
sketch (maybe abstract this behind an API) but this loses the advantage of
guaranteeing that engines don't have to read the sketch.

The spec change to make the property required seems to be the consensus on
the PR thread but I wanted to bring it up here in case others had different
ideas or if I'm missing any problems with this approach!


Thank you,

Amogh Jahagirdar

Making the NDV property required for theta sketch blobs in Puffin

Reply via email to