For the last month, I’ve been actively working on using the v2 spec in Spark. Specifically, my focus is to implement merge-on-read using the proposed API in Spark [1]. That’s why I would support the idea of adopting v2 as the current design is sufficient to implement considered use cases. I expect to find bugs and hit performance issues but I think we will be able to solve them without breaking the forward compatibility.
I’ll need a bit more time to dig into the NaN counts issue that Dan mentioned but I overall support the idea of adopting the v2 spec without making it default (until we prove its correctness and performance). - Anton [1] - https://github.com/apache/spark/pull/33008 <https://github.com/apache/spark/pull/33008> > On 19 Jul 2021, at 17:01, Ryan Blue <b...@tabular.io> wrote: > > I'll reply inline: > > On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dwe...@apache.org > <mailto:dwe...@apache.org>> wrote: > Overall, I'm in favor of what Ryan has proposed for v2 above. There are a > few points that I'm not entirely clear on, which may warrant discussion: > > 1) The discussion on resurfacing distinct_count may be something we want to > include since there is some value. I think it warrants keeping and dropping > the deprecation (see distinct count discussion in dev list). This actually > isn't a change proposal to v2, but rather signifying that it will be carried > forward. > > I don't think that this affects v2 because it is a non-breaking change to > forward-compatibility. I think of this as adding it back to v1 because we can > add optional fields without any problem. Older writers won't produce it, but > they never did before it was deprecated anyway. Older readers will simply > ignore it when reading files. > > 2) NaN counts would now be required for field_summary and I'm not clear on > what that implies for tables being promoted from v1 to v2 since missing this > information would require evaluating all of the partition values and > rewriting all of the manifest lists where this info is missing (it's still > optional for data files). I wasn't able to find the discussion on making > this required, but I assume it is so that we can accurately apply filters at > this level. Is this case handled? > > You might be right about not making this required. > > The intent was to change just the writer requirements. While that isn't > strictly necessary for some fields, it helps ensure that writers follow the > format as it evolves. But the problem is that the caller may not have this > information (whether any partition tuple contained NaN for that field) when > writing a new manifest list file because the old (v1) manifest list file > didn't include it. That would be known if the manifest was rewritten, but may > not be if the manifest is carried forward into a new manifest list. > > Making this required would mean that we need to set it to something when > writing. Setting it to `true` would be fine because if the field is missing, > we assume there may be a NaN value in a partition tuple and scan the manifest > when looking for NaNs. But it does seem strange to require that default so > I'd support removing this. > > I should also mention that even though this is optional for data files, we > should always know when writing a manifest. That's because this is about the > partition tuple, which is always present when writing a manifest. > > 3) If we're signaling the formal adoption of v2 by changing the default > table created, do we want to do that in the next release (0.12 assuming the > discussion/vote closes in time) or should we separate that with a second > release (0.13 or 1.0) to separate all of the other changes going into the > next release from the official format adoption? > > I don't think that we are signalling formal adoption by changing the default. > We are adopting by voting and will change the default when it makes sense to > do that based on real-world use. I think that adopting the proposed v2 format > means that we're confident that although there may be implementation bugs, we > are ready to support the new v2 format as we do v1. > > The Java release and v2 format adoption are somewhat orthogonal. We could > release 0.12.0 and when v2 is released later state that they are compatible. > I think the only question is whether we find anything that we need to change > in 0.12.0 in order to use v2 tables. That's why I think it is good to adopt > the spec without changing the default. People can opt into using the new > tables for the new use cases and we can get some real-world use before > changing the default. The main consequence of adoption is that we commit to > supporting tables written for the v2 spec (as opposed to changing v2 in > incompatible ways) and we start to guarantee forward-compatibility (as > opposed to making more optional -> required changes). > > Ryan > > -- > Ryan Blue > Tabular