I also support making this field required.

Thanks,
Nick Riasanovsky

On Fri, Jun 21, 2024 at 7:00 PM Szehon Ho <szehon.apa...@gmail.com> wrote:

> It makes sense to me, normally changing optional -> required would
> probably require a version bump, but maybe it is ok here as it is a
> relatively new format, afaik adapted by Trino which already sets this
> field, but let's see if anyone disagrees.
>
> Thanks
> Szehon
>
> On Fri, Jun 21, 2024 at 3:35 PM huaxin gao <huaxin.ga...@gmail.com> wrote:
>
>> +1 for making the ndv blob metadata property required for theta sketches.
>>
>> On Fri, Jun 21, 2024 at 2:54 PM Amogh Jahagirdar <2am...@gmail.com>
>> wrote:
>>
>>> Hey all,
>>>
>>> I wanted to raise this thread to discuss a spec change proposal
>>> <https://github.com/apache/iceberg/pull/10549> for making the ndv blob
>>> metadata property required for theta sketches. Currently, the spec is a bit
>>> loose stating:
>>>
>>> The blob metadata for this blob *may* include following properties:
>>>
>>>    - ndv: estimate of number of distinct values, derived from the sketch
>>>
>>>
>>> This came up on this PR
>>> <https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695> 
>>> where
>>> it came up that engines like Presto/Trino are using the property as a
>>> source of truth and the implementation of the Spark procedure in the PR
>>> originally was deriving the NDV from the sketch itself. It's currently
>>> unclear what engine integrations should use as a source of truth.
>>>
>>> The main advantage of having it in the properties is that engines don't
>>> have to go and deserialize the sketch/compute the NDV if they just want the
>>> NDV (putting aside the intersection/union case where I think engines would
>>> have to read the sketch). I think this makes it easier for engine
>>> integration. The spec also currently makes it clear that the property must
>>> be derived from the sketch so I don't think there's a "source of truth"
>>> sync concern. It also should be easy for blob writers to set this property
>>> since they'd anyways be populating the sketch in the first place.
>>>
>>> An alternative is to attempt to read the property and fallback to the
>>> sketch (maybe abstract this behind an API) but this loses the advantage of
>>> guaranteeing that engines don't have to read the sketch.
>>>
>>> The spec change to make the property required seems to be the consensus
>>> on the PR thread but I wanted to bring it up here in case others had
>>> different ideas or if I'm missing any problems with this approach!
>>>
>>>
>>> Thank you,
>>>
>>> Amogh Jahagirdar
>>>
>>

Reply via email to