Hey all,

Thanks everyone for providing your input on this discussion thread. There are 2 
points I'd like to bring up, which in hindsight I should've recognized.

If we look at the current Iceberg Improvement Process 
https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals and 
how proposals are adopted 
https://iceberg.apache.org/contribute/#how-are-proposals-adopted, since this 
counts as spec change it should go through a formal vote process as JB alluded 
to.

Next, even though this spec change may be considered trivial and Puffin is 
still in the early stages of adoption, the Puffin spec v1 as it is currently 
written was already voted on, see: 
https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q.
Since the change I'm proposing is converting an optional property into 
required, what this probably means is we should add this as part of a Puffin 
spec v2 (like Szehon was referring to). Thinking about this more, I believe 
this is important because going forward it should be unambiguous on what the 
practice is when making spec changes like this.

What I'm now proposing:

1.) Start a vote on making the NDV property required for theta sketch blobs in 
a Puffin V2 format.  Since there seems to be general support on this thread 
about making this property required, I think we can go into a vote, but to be 
very clear this requirement would apply for the next Puffin version.

2.) For the current PR for analyze table 
https://github.com/apache/iceberg/pull/10288 and other engine integrations with 
Puffin, they'd have to follow the spec as it is today which means doing a best 
effort read of the property, and if it does not exist, derive the NDV from the 
sketch.

Please let me know your thoughts! 

Thanks,

Amogh Jahagirdar

On 2024/06/21 21:54:13 Amogh Jahagirdar wrote:
> Hey all,
> 
> I wanted to raise this thread to discuss a spec change proposal
> <https://github.com/apache/iceberg/pull/10549> for making the ndv blob
> metadata property required for theta sketches. Currently, the spec is a bit
> loose stating:
> 
> The blob metadata for this blob *may* include following properties:
> 
>    - ndv: estimate of number of distinct values, derived from the sketch
> 
> 
> This came up on this PR
> <https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695>
> where
> it came up that engines like Presto/Trino are using the property as a
> source of truth and the implementation of the Spark procedure in the PR
> originally was deriving the NDV from the sketch itself. It's currently
> unclear what engine integrations should use as a source of truth.
> 
> The main advantage of having it in the properties is that engines don't
> have to go and deserialize the sketch/compute the NDV if they just want the
> NDV (putting aside the intersection/union case where I think engines would
> have to read the sketch). I think this makes it easier for engine
> integration. The spec also currently makes it clear that the property must
> be derived from the sketch so I don't think there's a "source of truth"
> sync concern. It also should be easy for blob writers to set this property
> since they'd anyways be populating the sketch in the first place.
> 
> An alternative is to attempt to read the property and fallback to the
> sketch (maybe abstract this behind an API) but this loses the advantage of
> guaranteeing that engines don't have to read the sketch.
> 
> The spec change to make the property required seems to be the consensus on
> the PR thread but I wanted to bring it up here in case others had different
> ideas or if I'm missing any problems with this approach!
> 
> 
> Thank you,
> 
> Amogh Jahagirdar
> 

Reply via email to