davidwilcox opened a new issue #2130:
URL: https://github.com/apache/iceberg/issues/2130


   I have a process that reads Tables stored in Iceberg and processes them, 
many at a time. Lately, we've had problems with the scalability of our process 
due to the number of Hadoop Filesystem objects created inside Iceberg for 
Tables with many snapshots. These tables could have tens of thousands of 
snapshots inside, but I only want to read the latest snapshot. Inside the 
Hadoop Filesystem creation code that's called for every snapshot, there are 
process-level locks that end up locking up my whole process.
   
   Inside TableMetadataParser, it looks like we read in every snapshot even 
though the reader likely only wants one snapshot. This loop is what's 
responsible for locking up my process.
   
https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320
   
   I noticed that my process does not care about the whole snapshot list. My 
process only is interested in a particular snapshot -- just one of them. I'm 
interested in making a contribution so that the entire snapshot list is lazily 
calculated inside of TableMetadata where it's actually used. So, we would not 
create the Snapshot itself in TableMetadataParser, but instead likely would 
pass a SnapshotCreator in that could know how to create snapshots. We would 
pass all of the SnapshotCreators into TableMetadata which would create 
snapshots when needed.
   
   Would you be amenable to such a change? I want to make sure that you think 
that this sounds like something you would accept before I spend time coding it 
up.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to