Overall, I'm in favor of what Ryan has proposed for v2 above.  There are a
few points that I'm not entirely clear on, which may warrant discussion:

1)  The discussion on resurfacing distinct_count may be something we want
to include since there is some value.  I think it warrants keeping and
dropping the deprecation (see distinct count discussion in dev list).  This
actually isn't a change proposal to v2, but rather signifying that it will
be carried forward.

2)  NaN counts would now be required for field_summary and I'm not clear on
what that implies for tables being promoted from v1 to v2 since missing
this information would require evaluating all of the partition values and
rewriting all of the manifest lists where this info is missing (it's still
optional for data files).  I wasn't able to find the discussion on making
this required, but I assume it is so that we can accurately apply filters
at this level.  Is this case handled?

3)  If we're signaling the formal adoption of v2 by changing the default
table created, do we want to do that in the next release (0.12 assuming the
discussion/vote closes in time) or should we separate that with a second
release (0.13 or 1.0) to separate all of the other changes going into the
next release from the official format adoption?

-Dan





On Sun, Jul 11, 2021 at 5:51 PM Ryan Blue <b...@apache.org> wrote:

> Hi everyone,
>
> I’d like to start a discussion about adopting the current set of changes
> to the spec as v2. Adopting the current set of changes means that we won’t
> have the ability to include any more breaking changes in v2. Any new
> breaking change would require v3. This would not stop or affect any of the
> ongoing work to add secondary indexes or new metadata because those
> additions are forward-compatible (old readers can ignore it).
>
> The main feature that has been added to the spec is row-level deletes.
> This requires a breaking change because all readers must be updated to
> apply the deletes when reading. Existing readers don’t do that, so a new
> format version is required to ensure correctness for tables that track
> delete files. Older readers are not forward-compatible with v2 tables.
>
> Row-level delete changes include:
>
>    - Addition of equality and positional delete files
>    - Addition of content column in manifests to track the type of file
>    (data, position deletes, equality deletes)
>    - Addition of sequence_number column in manifests to track when in
>    time a data or delete file was added
>    - New rules for inheriting sequence numbers through the metadata tree
>    to keep metadata write amplification low
>
> In addition, v2 tightens requirements for additions that were compatible
> with v1. For example, in v1 we added schemas and current-schema-id to
> table metadata to allow tracking multiple versions of the table schema and
> tracking the schema that was current when a given snapshot was written. The
> table metadata fields were optional in v1 but became mandatory in v2. Also,
> the schema field is removed in v2 in favor of the new fields.
>
> Features that are supported by v2 are:
>
>    - Row-level deletes
>    - Multiple schemas, snapshot schemas
>    - Multiple partition specs, better partition evolution
>    - Table sort orders
>    - Tracking identifier fields
>    - NaN value counts for double and float fields
>    - Manifest file encryption metadata
>
> The full list of v2 changes <https://iceberg.apache.org/spec/#version-2>
> is documented at the end of the spec.
>
> I should also mention that v2 is not the default for tables in the
> reference implementation (Java). We will continue building out better
> support for row-level deletes, but I think that we are confident that the
> row-level deletes design works and isn’t going to need breaking changes.
>
> I think the next steps are to discuss the current v2 spec changes and make
> sure everyone is comfortable adopting them. If that goes well, we’ll have a
> vote to adopt the changes. Thanks for discussing, everyone!
>
> Ryan
> --
> Ryan Blue
>

Reply via email to