Reo-LEI opened a new issue #3102:
URL: https://github.com/apache/iceberg/issues/3102


   Currently, `IcebergFilesCommitter` will validate all snapshot history for 
every time commit new snapshot  in `commitDeltaTxn` . That means that the same 
snapshot will be verified multiple times, and take a lot of time to read 
manifests and manifest file.  And That is the reason why for 
`IcebergFilesCommitter` need opening multiple Avro metadata files and take 
several minutes 
    in https://github.com/apache/iceberg/issues/2900#issuecomment-895244837 
(the more detailed reason is that flink will call 
`notifyCheckpointComplete(ckptId)` immediately after calling 
`snapshotState(ckptId)`, and committer will travel all snapshot history 
   to verify whether the data files which are referenced by pos-delete files 
still exists. That will block the commiter thread and make 
`snapshotState(ckptId+1)` timeout if hdfs response slow or table has too many 
manifest file need to travel.)
   
   I think `IcebergFilesCommitter` doesn't need to validate all snapshot 
history for every commit, just need to validate snapshots between last 
committed snapshot id and current snapshot id. For `IcebergFilesCommitter` 
first commit, we still need to travel all snapshot history to ensure referenced 
data files still exists.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to