Reo-LEI opened a new issue #3102:
URL: https://github.com/apache/iceberg/issues/3102
Currently, `IcebergFilesCommitter` will validate all snapshot history for
every time commit new snapshot in `commitDeltaTxn` . That means that the same
snapshot will be verified multiple times, and take a lot of time to read
manifests and manifest file. And That is the reason why for
`IcebergFilesCommitter` need opening multiple Avro metadata files and take
several minutes
in https://github.com/apache/iceberg/issues/2900#issuecomment-895244837
(the more detailed reason is that flink will call
`notifyCheckpointComplete(ckptId)` immediately after calling
`snapshotState(ckptId)`, and committer will travel all snapshot history
to verify whether the data files which are referenced by pos-delete files
still exists. That will block the commiter thread and make
`snapshotState(ckptId+1)` timeout if hdfs response slow or table has too many
manifest file need to travel.)
I think `IcebergFilesCommitter` doesn't need to validate all snapshot
history for every commit, just need to validate snapshots between last
committed snapshot id and current snapshot id. For `IcebergFilesCommitter`
first commit, we still need to travel all snapshot history to ensure referenced
data files still exists.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]