emkornfield commented on issue #13855:
URL: https://github.com/apache/iceberg/issues/13855#issuecomment-3444226377

   > Yes, that's correct. We started with spec proposal to link the schema ID 
in the beginning but then changed our approach later.
   
   Right so my understanding, it looks like the new metrics are only returned 
from for file write (not during scan), so in order for this to be useful, the 
new column does in fact need to be persisted in Manifest file.  If this is an 
accurate understanding, then is a spec change (we are adding a column to the 
manifest file).   IMO, I really think before we review the implementation PR 
there should be a design doc, and associated proposal.  In general open 
questions on my mind are:
   *  Exactly what use-case we are looking.  How sparse are the tables, in 
terms of all null columns? 
   *  Different options with coverage for the use cases:
      - Storing write Schema ID (why was this discarded)?
      - Storing write Schema ID with specific set of columns not written for 
that schema (in theory we should still be writing all columns so I'm not sure 
this makes sense.
      - Schema ID + Having additional config that stores the value count and 
null count in Metrics/statistics for all columns that are completely null ()
      - Storing every column written to the particular file (seems what this is 
the PR is doing).
   
   For storing a series of columns what representation do we want? It seems the 
main trade-offs here are expected sparsity of columns in files, so 
understanding expected requirements here is important.
   
   Having this in a google doc I think is better avenue to collaborate then 
this issue, as it allows for side comments/questions.  Once there is alignment 
there proposing the spec change, and then an PR implementing it seem like the 
logical next steps.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to