emkornfield commented on issue #13855:
URL: https://github.com/apache/iceberg/issues/13855#issuecomment-3229307185

   Could we dive into these use-cases and how often they ocurr.  For 
UnknownType columns why do we care about these specifically at the metadata 
level (we really just care that they are all null?)
   
   What use-cases have high enough schema churn that we expect to get a large 
performance boost here?  
   
   > I probably would not link schema ID because that alone would not indicate 
the presence of a field (Optional Fields) but we probably should have some way 
in the metrics of determining the difference between a "missing metric" and a 
missing field.
   
   In terms of approach, I think linking schema ID probably solves 90 + % of 
the use case defined.  If a file is written with an older schema we can 
definitely say the new column is not present.  But given writers are expected 
to write all columns the only case where a column is missing would be for 
direct file imports (in this case we'd have to at least open the file to 
collect the metrics which might not be cheap).  If presence/absence is very 
important to certain writers then ensure we collect Value Count/Null count as 
statistics is an alternative that already exists for the filtering.
   
   If we want to optimize the last 10% of the use case, I'd say it would like 
be better to link two columns:
   1.  Schema ID.
   2. Files not present in the schema in the file.
   
   Overall, I expect 2 to be empty most of the time, and either way scales with 
the exception case, rather then schema size.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to