emkornfield commented on issue #13855:
URL: https://github.com/apache/iceberg/issues/13855#issuecomment-3444226377
> Yes, that's correct. We started with spec proposal to link the schema ID
in the beginning but then changed our approach later.
Right so my understanding, it looks like the new metrics are only returned
from for file write (not during scan), so in order for this to be useful, the
new column does in fact need to be persisted in Manifest file. If this is an
accurate understanding, then is a spec change (we are adding a column to the
manifest file). IMO, I really think before we review the implementation PR
there should be a design doc, and associated proposal. In general open
questions on my mind are:
* Exactly what use-case we are looking. How sparse are the tables, in
terms of all null columns?
* Different options with coverage for the use cases:
- Storing write Schema ID (why was this discarded)?
- Storing write Schema ID with specific set of columns not written for
that schema (in theory we should still be writing all columns so I'm not sure
this makes sense.
- Schema ID + Having additional config that stores the value count and
null count in Metrics/statistics for all columns that are completely null ()
- Storing every column written to the particular file (seems what this is
the PR is doing).
For storing a series of columns what representation do we want? It seems the
main trade-offs here are expected sparsity of columns in files, so
understanding expected requirements here is important.
Having this in a google doc I think is better avenue to collaborate then
this issue, as it allows for side comments/questions. Once there is alignment
there proposing the spec change, and then an PR implementing it seem like the
logical next steps.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]