ConeyLiu commented on PR #4898: URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1210127293
> I think the situation would be the same even in your proposal to add new schemaid field to data_file, right? After rewriteDataFiles we have to carry over the latest schema-id of each spec , in order for your initial proposed optimization to be accurate? Because there may be data in the new file that was written by a later schema. You are correct. The data file with the new spec after rewrite. We can not benefit from the schema evaluation because we lost the original schema information. > As far as I can tell, it seems to be the right one that the manifest was written in, even after rewriteManifests. In RewriteManifest, we use the current table partition [spec](https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java#L103) or the specified spec with spec ID. I think the schema used in the current space is not the same as the original schema for the old manifest file. That's because we will [rewrite the partition spec](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L952) when updating the table schema. Please correct me if I am wrong. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
