manirajv06 opened a new issue, #13855: URL: https://github.com/apache/iceberg/issues/13855
### Proposed Change Schema evolve over time and data files could have different columns at different point of time. It is quite natural that data files created at T1 with Schema S1 could have columns C1 to C5, data files created at T2 with Schema S1 could have columns C4 to C10 and so on.. Linking Schema ID with data files would be handy to extract any Schema details easily. For an instance, Files could be filtered based on whether column exists or not using its field id by comparing with file's max field id. Max field id of the file is nothing but the max field id of the linked schema. Schema's Max field id is already available and can be used straight away. C5 is the max field id for all files linked to S1. C10 is the max field id for all files linked to S2. Another instance, to know whether Parquet files has `UnknownType` type or not, all files needs to be opened as there is no statistics or other way to know it. Linking schema's to these files could pull those info very easily. Similarly, other schema info can be used based on the requirements. I would like to propose that linking the schema id with files would be useful in carrying out files and schema related operations going forward. ### Proposal document _No response_ ### Specifications - [ ] Table - [ ] View - [ ] REST - [ ] Puffin - [ ] Encryption - [ ] Other -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org