ConeyLiu commented on PR #4898:
URL: https://github.com/apache/iceberg/pull/4898#issuecomment-1210127293

   > I think the situation would be the same even in your proposal to add new 
schemaid field to data_file, right? After rewriteDataFiles we have to carry 
over the latest schema-id of each spec , in order for your initial proposed 
optimization to be accurate? Because there may be data in the new file that was 
written by a later schema.
   
   You are correct. The data file with the new spec after rewrite. We can not 
benefit from the schema evaluation because we lost the original schema 
information.
   
   > As far as I can tell, it seems to be the right one that the manifest was 
written in, even after rewriteManifests. 
   
   In RewriteManifest, we use the current table partition 
[spec](https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteManifestsSparkAction.java#L103)
 or the specified spec with spec ID. I think the schema used in the current 
space is not the same as the original schema for the old manifest file. That's 
because we will [rewrite the partition 
spec](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L952)
 when updating the table schema. Please correct me if I am wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to