For the last month, I’ve been actively working on using the v2 spec in Spark. 
Specifically, my focus is to implement merge-on-read using the proposed API in 
Spark [1]. That’s why I would support the idea of adopting v2 as the current 
design is sufficient to implement considered use cases. I expect to find bugs 
and hit performance issues but I think we will be able to solve them without 
breaking the forward compatibility.

I’ll need a bit more time to dig into the NaN counts issue that Dan mentioned 
but I overall support the idea of adopting the v2 spec without making it 
default (until we prove its correctness and performance).

- Anton

[1] - https://github.com/apache/spark/pull/33008 
<https://github.com/apache/spark/pull/33008>

> On 19 Jul 2021, at 17:01, Ryan Blue <b...@tabular.io> wrote:
> 
> I'll reply inline:
> 
> On Thu, Jul 15, 2021 at 1:46 PM Daniel Weeks <dwe...@apache.org 
> <mailto:dwe...@apache.org>> wrote:
> Overall, I'm in favor of what Ryan has proposed for v2 above.  There are a 
> few points that I'm not entirely clear on, which may warrant discussion:
> 
> 1)  The discussion on resurfacing distinct_count may be something we want to 
> include since there is some value.  I think it warrants keeping and dropping 
> the deprecation (see distinct count discussion in dev list).  This actually 
> isn't a change proposal to v2, but rather signifying that it will be carried 
> forward.
> 
> I don't think that this affects v2 because it is a non-breaking change to 
> forward-compatibility. I think of this as adding it back to v1 because we can 
> add optional fields without any problem. Older writers won't produce it, but 
> they never did before it was deprecated anyway. Older readers will simply 
> ignore it when reading files.
>  
> 2)  NaN counts would now be required for field_summary and I'm not clear on 
> what that implies for tables being promoted from v1 to v2 since missing this 
> information would require evaluating all of the partition values and 
> rewriting all of the manifest lists where this info is missing (it's still 
> optional for data files).  I wasn't able to find the discussion on making 
> this required, but I assume it is so that we can accurately apply filters at 
> this level.  Is this case handled?
> 
> You might be right about not making this required.
> 
> The intent was to change just the writer requirements. While that isn't 
> strictly necessary for some fields, it helps ensure that writers follow the 
> format as it evolves. But the problem is that the caller may not have this 
> information (whether any partition tuple contained NaN for that field) when 
> writing a new manifest list file because the old (v1) manifest list file 
> didn't include it. That would be known if the manifest was rewritten, but may 
> not be if the manifest is carried forward into a new manifest list.
> 
> Making this required would mean that we need to set it to something when 
> writing. Setting it to `true` would be fine because if the field is missing, 
> we assume there may be a NaN value in a partition tuple and scan the manifest 
> when looking for NaNs. But it does seem strange to require that default so 
> I'd support removing this.
> 
> I should also mention that even though this is optional for data files, we 
> should always know when writing a manifest. That's because this is about the 
> partition tuple, which is always present when writing a manifest.
>  
> 3)  If we're signaling the formal adoption of v2 by changing the default 
> table created, do we want to do that in the next release (0.12 assuming the 
> discussion/vote closes in time) or should we separate that with a second 
> release (0.13 or 1.0) to separate all of the other changes going into the 
> next release from the official format adoption?
> 
> I don't think that we are signalling formal adoption by changing the default. 
> We are adopting by voting and will change the default when it makes sense to 
> do that based on real-world use. I think that adopting the proposed v2 format 
> means that we're confident that although there may be implementation bugs, we 
> are ready to support the new v2 format as we do v1.
> 
> The Java release and v2 format adoption are somewhat orthogonal. We could 
> release 0.12.0 and when v2 is released later state that they are compatible. 
> I think the only question is whether we find anything that we need to change 
> in 0.12.0 in order to use v2 tables. That's why I think it is good to adopt 
> the spec without changing the default. People can opt into using the new 
> tables for the new use cases and we can get some real-world use before 
> changing the default. The main consequence of adoption is that we commit to 
> supporting tables written for the v2 spec (as opposed to changing v2 in 
> incompatible ways) and we start to guarantee forward-compatibility (as 
> opposed to making more optional -> required changes).
> 
> Ryan
> 
> -- 
> Ryan Blue
> Tabular

Reply via email to