Hi Iceberg Devs, I have a process that reads Tables stored in Iceberg and processes them, many at a time. Lately, we've had problems with the scalability of our process due to the number of Hadoop Filesystem objects created inside Iceberg for Tables with many snapshots. These tables could have tens of thousands of snapshots inside, but I only want to read the latest snapshot. Inside the Hadoop Filesystem creation code that's called for every snapshot, there are process-level locks that end up locking up my whole process.
Inside TableMetadataParser, it looks like we read in every snapshot even though the reader likely only wants one snapshot. This loop is what's responsible for locking up my process. https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320 I noticed that my process does not care about the whole snapshot list. My process only is interested in a particular snapshot -- just one of them. I'm interested in making a contribution so that the entire snapshot list is lazily calculated inside of TableMetadata where it's actually used. So, we would not create the Snapshot itself in TableMetadataParser, but instead likely would pass a SnapshotCreator in that could know how to create snapshots. We would pass all of the SnapshotCreators into TableMetadata which would create snapshots when needed. Would you be amenable to such a change? I want to make sure that you think that this sounds like something you would accept before I spend time coding it up. Any other thoughts on this? Thanks, David Wilcox