emkornfield commented on issue #13855: URL: https://github.com/apache/iceberg/issues/13855#issuecomment-3229307185
Could we dive into these use-cases and how often they ocurr. For UnknownType columns why do we care about these specifically at the metadata level (we really just care that they are all null?) What use-cases have high enough schema churn that we expect to get a large performance boost here? > I probably would not link schema ID because that alone would not indicate the presence of a field (Optional Fields) but we probably should have some way in the metrics of determining the difference between a "missing metric" and a missing field. In terms of approach, I think linking schema ID probably solves 90 + % of the use case defined. If a file is written with an older schema we can definitely say the new column is not present. But given writers are expected to write all columns the only case where a column is missing would be for direct file imports (in this case we'd have to at least open the file to collect the metrics which might not be cheap). If presence/absence is very important to certain writers then ensure we collect Value Count/Null count as statistics is an alternative that already exists for the filtering. If we want to optimize the last 10% of the use case, I'd say it would like be better to link two columns: 1. Schema ID. 2. Files not present in the schema in the file. Overall, I expect 2 to be empty most of the time, and either way scales with the exception case, rather then schema size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org