The changes that were suggested were just merged:
* #2864 <https://github.com/apache/iceberg/pull/2864> made the contains_nan
field optional
* #2805 <https://github.com/apache/iceberg/pull/2805> added back
distinct_counts (not required but good to have done, too)

It looks like there's mostly consensus around the v2 spec, and it's great
that we've already seen validation in both the Flink and Spark integration.
I'll start a vote thread to adopt the spec.

Ryan

On Fri, Jul 23, 2021 at 4:37 PM Ryan Blue <b...@tabular.io> wrote:

> I'm going to revert the change to NaN tracking that makes that field
> required. I think we can make other fields required in table metadata.json
> files and manifests, but that one in the manifest list isn't a good idea.
> I'll open a PR to update it this weekend and I'll update the distinct
> counts PR from the discussion there. After that, I think we'll be ready to
> go.
>
> On Fri, Jul 23, 2021 at 4:29 PM Anton Okolnychyi
> <aokolnyc...@apple.com.invalid> wrote:
>
>> For the last month, I’ve been actively working on using the v2 spec in
>> Spark. Specifically, my focus is to implement merge-on-read using the
>> proposed API in Spark [1]. That’s why I would support the idea of adopting
>> v2 as the current design is sufficient to implement considered use cases. I
>> expect to find bugs and hit performance issues but I think we will be able
>> to solve them without breaking the forward compatibility.
>>
>> I’ll need a bit more time to dig into the NaN counts issue that Dan
>> mentioned but I overall support the idea of adopting the v2 spec without
>> making it default (until we prove its correctness and performance).
>>
>> - Anton
>>
>> [1] - https://github.com/apache/spark/pull/33008
>>
>> On 19 Jul 2021, at 17:01, Ryan Blue <b...@tabular.io> wrote:
>>
>> I'll reply inline:
>>
>> On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dwe...@apache.org> wrote:
>>
>>> Overall, I'm in favor of what Ryan has proposed for v2 above.  There are
>>> a few points that I'm not entirely clear on, which may warrant discussion:
>>>
>>> 1)  The discussion on resurfacing distinct_count may be something we
>>> want to include since there is some value.  I think it warrants keeping and
>>> dropping the deprecation (see distinct count discussion in dev list).  This
>>> actually isn't a change proposal to v2, but rather signifying that it will
>>> be carried forward.
>>>
>>
>> I don't think that this affects v2 because it is a non-breaking change to
>> forward-compatibility. I think of this as adding it back to v1 because we
>> can add optional fields without any problem. Older writers won't produce
>> it, but they never did before it was deprecated anyway. Older readers will
>> simply ignore it when reading files.
>>
>>
>>> 2)  NaN counts would now be required for field_summary and I'm not clear
>>> on what that implies for tables being promoted from v1 to v2 since missing
>>> this information would require evaluating all of the partition values and
>>> rewriting all of the manifest lists where this info is missing (it's still
>>> optional for data files).  I wasn't able to find the discussion on making
>>> this required, but I assume it is so that we can accurately apply filters
>>> at this level.  Is this case handled?
>>>
>>
>> You might be right about not making this required.
>>
>> The intent was to change just the writer requirements. While that isn't
>> strictly necessary for some fields, it helps ensure that writers follow the
>> format as it evolves. But the problem is that the caller may not have this
>> information (whether any partition tuple contained NaN for that field) when
>> writing a new manifest list file because the old (v1) manifest list file
>> didn't include it. That would be known if the manifest was rewritten, but
>> may not be if the manifest is carried forward into a new manifest list.
>>
>> Making this required would mean that we need to set it to something when
>> writing. Setting it to `true` would be fine because if the field is
>> missing, we assume there may be a NaN value in a partition tuple and scan
>> the manifest when looking for NaNs. But it does seem strange to require
>> that default so I'd support removing this.
>>
>> I should also mention that even though this is optional for data files,
>> we should always know when writing a manifest. That's because this is about
>> the partition tuple, which is always present when writing a manifest.
>>
>>
>>> 3)  If we're signaling the formal adoption of v2 by changing the default
>>> table created, do we want to do that in the next release (0.12 assuming the
>>> discussion/vote closes in time) or should we separate that with a second
>>> release (0.13 or 1.0) to separate all of the other changes going into the
>>> next release from the official format adoption?
>>>
>>
>> I don't think that we are signalling formal adoption by changing the
>> default. We are adopting by voting and will change the default when it
>> makes sense to do that based on real-world use. I think that adopting the
>> proposed v2 format means that we're confident that although there may be
>> implementation bugs, we are ready to support the new v2 format as we do v1.
>>
>> The Java release and v2 format adoption are somewhat orthogonal. We could
>> release 0.12.0 and when v2 is released later state that they are
>> compatible. I think the only question is whether we find anything that we
>> need to change in 0.12.0 in order to use v2 tables. That's why I think it
>> is good to adopt the spec without changing the default. People can opt into
>> using the new tables for the new use cases and we can get some real-world
>> use before changing the default. The main consequence of adoption is that
>> we commit to supporting tables written for the v2 spec (as opposed to
>> changing v2 in incompatible ways) and we start to guarantee
>> forward-compatibility (as opposed to making more optional -> required
>> changes).
>>
>> Ryan
>>
>> --
>> Ryan Blue
>> Tabular
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>


-- 
Ryan Blue
Tabular

Reply via email to