Good point, the changes are contained to the theta sketch definition in this 
case. Nothing is changing in the Puffin metadata itself in this change. I 
missed that in the spec we already versioned the blob type for example: 
https://iceberg.apache.org/puffin-spec/#apache-datasketches-theta-v1-blob-type.

Considering this, the update should probably be more granular at the sketch 
level rather than a Puffin file format upgrade. If I recall correctly, this was 
intentionally a design in Puffin so that new blob types could be added without 
necessarily needing an entire format upgrade.

Thanks,

Amogh Jahagirdar

On 2024/07/05 17:18:56 Ryan Blue wrote:
> For the compatibility and version bump question, shouldn't this be a new
> revision of the theta sketch rather than a new version of Puffin? I think
> we want to avoid making changes to the file format itself if we can contain
> the changes to the sketch definitions.
> 
> +1 for raising a vote once we have a PR with the full changes.
> 
> On Fri, Jul 5, 2024 at 9:35 AM Amogh Jahagirdar <amo...@apache.org> wrote:
> 
> > Hey all,
> >
> > Thanks everyone for providing your input on this discussion thread. There
> > are 2 points I'd like to bring up, which in hindsight I should've
> > recognized.
> >
> > If we look at the current Iceberg Improvement Process
> > https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals
> > and how proposals are adopted
> > https://iceberg.apache.org/contribute/#how-are-proposals-adopted, since
> > this counts as spec change it should go through a formal vote process as JB
> > alluded to.
> >
> > Next, even though this spec change may be considered trivial and Puffin is
> > still in the early stages of adoption, the Puffin spec v1 as it is
> > currently written was already voted on, see:
> > https://lists.apache.org/thread/950rz31y3kr3kz0zzncwokvgzbrmmz4q.
> > Since the change I'm proposing is converting an optional property into
> > required, what this probably means is we should add this as part of a
> > Puffin spec v2 (like Szehon was referring to). Thinking about this more, I
> > believe this is important because going forward it should be unambiguous on
> > what the practice is when making spec changes like this.
> >
> > What I'm now proposing:
> >
> > 1.) Start a vote on making the NDV property required for theta sketch
> > blobs in a Puffin V2 format.  Since there seems to be general support on
> > this thread about making this property required, I think we can go into a
> > vote, but to be very clear this requirement would apply for the next Puffin
> > version.
> >
> > 2.) For the current PR for analyze table
> > https://github.com/apache/iceberg/pull/10288 and other engine
> > integrations with Puffin, they'd have to follow the spec as it is today
> > which means doing a best effort read of the property, and if it does not
> > exist, derive the NDV from the sketch.
> >
> > Please let me know your thoughts!
> >
> > Thanks,
> >
> > Amogh Jahagirdar
> >
> > On 2024/06/21 21:54:13 Amogh Jahagirdar wrote:
> > > Hey all,
> > >
> > > I wanted to raise this thread to discuss a spec change proposal
> > > <https://github.com/apache/iceberg/pull/10549> for making the ndv blob
> > > metadata property required for theta sketches. Currently, the spec is a
> > bit
> > > loose stating:
> > >
> > > The blob metadata for this blob *may* include following properties:
> > >
> > >    - ndv: estimate of number of distinct values, derived from the sketch
> > >
> > >
> > > This came up on this PR
> > > <
> > https://github.com/apache/iceberg/pull/10288/files#discussion_r1622261695>
> > > where
> > > it came up that engines like Presto/Trino are using the property as a
> > > source of truth and the implementation of the Spark procedure in the PR
> > > originally was deriving the NDV from the sketch itself. It's currently
> > > unclear what engine integrations should use as a source of truth.
> > >
> > > The main advantage of having it in the properties is that engines don't
> > > have to go and deserialize the sketch/compute the NDV if they just want
> > the
> > > NDV (putting aside the intersection/union case where I think engines
> > would
> > > have to read the sketch). I think this makes it easier for engine
> > > integration. The spec also currently makes it clear that the property
> > must
> > > be derived from the sketch so I don't think there's a "source of truth"
> > > sync concern. It also should be easy for blob writers to set this
> > property
> > > since they'd anyways be populating the sketch in the first place.
> > >
> > > An alternative is to attempt to read the property and fallback to the
> > > sketch (maybe abstract this behind an API) but this loses the advantage
> > of
> > > guaranteeing that engines don't have to read the sketch.
> > >
> > > The spec change to make the property required seems to be the consensus
> > on
> > > the PR thread but I wanted to bring it up here in case others had
> > different
> > > ideas or if I'm missing any problems with this approach!
> > >
> > >
> > > Thank you,
> > >
> > > Amogh Jahagirdar
> > >
> >
> 
> 
> -- 
> Ryan Blue
> Databricks
> 

Reply via email to