Ahh. This is better. I hadn't gotten any emails from anyone on this list earlier! This is refreshing!
Yes. I did a change myself, but then noticed and communicated with people inside Adobe (Gautam included) that your change, Ryan fixed my problem. Thanks for that! 😄 I also filed the issue here: https://github.com/apache/iceberg/issues?q=is%3Aissue [https://avatars.githubusercontent.com/u/47359?s=400&v=4]<https://github.com/apache/iceberg/issues?q=is%3Aissue> apache/iceberg<https://github.com/apache/iceberg/issues?q=is%3Aissue> Apache Iceberg. Contribute to apache/iceberg development by creating an account on GitHub. github.com As far as I'm concerned, we can consider this issue solved, I think. Thanks! ________________________________ From: Gautam <gautamkows...@gmail.com> Sent: Tuesday, January 26, 2021 12:56 PM To: Iceberg Dev List <dev@iceberg.apache.org>; Ryan Blue <rb...@netflix.com> Cc: Gautam Kowshik <kows...@adobe.com>; Xabriel Collazo Mojica <xcoll...@adobe.com>; Grp-XAD <x...@adobe.com>; David Wilcox <dawil...@adobe.com> Subject: Re: Ways To Alleviate Load For Tables With Many Snapshots + dawilcox On Tue, Jan 26, 2021 at 11:46 AM Gautam <gautamkows...@gmail.com<mailto:gautamkows...@gmail.com>> wrote: Hey Ryan & David, I believe this change from you [1] indirectly achieves this. David's issue is that every table.load() is instantiating one FS handle for each snapshot, and in your change, by converting the File reference into location string this is already a lazy read (in a way?). The version David has been testing with was before this change. I believe with the change in [1] the FS handles issue should be resolved. Please correct me if I'm wrong David/ Ryan. thanks and regards, -Gautam. [1] - https://github.com/apache/iceberg/pull/1085/files<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F1085%2Ffiles&data=04%7C01%7Cdawilcox%40adobe.com%7Cbc1c3031465d4d9c4cb408d8c2348a2b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637472878209170459%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=CfLCzSomjf1jbWuBqcBkhA6z4ux8Up0bCmwQl2afFsk%3D&reserved=0> On Tue, Jan 26, 2021 at 10:55 AM Ryan Blue <rb...@netflix.com.invalid> wrote: David, We could probably make it so that Snapshot instances are lazily created from the metadata file, but that would be a fairly large change. If you're interested, we can definitely make it happen. I agree with Vivekanand, though. A much easier solution is to reduce the number of snapshots in the table by expiring them. How long are you retaining snapshots? rb On Thu, Jan 21, 2021 at 8:11 PM Vivekanand Vellanki <vi...@dremio.com<mailto:vi...@dremio.com>> wrote: Just curious, what is the need to retain all those snapshots? I would assume that there is a mechanism to expire snapshots and delete data/manifest files that are no longer required. On Thu, Jan 21, 2021 at 11:01 PM David Wilcox <dawil...@adobe.com.invalid> wrote: Hi Iceberg Devs, I have a process that reads Tables stored in Iceberg and processes them, many at a time. Lately, we've had problems with the scalability of our process due to the number of Hadoop Filesystem objects created inside Iceberg for Tables with many snapshots. These tables could have tens of thousands of snapshots inside, but I only want to read the latest snapshot. Inside the Hadoop Filesystem creation code that's called for every snapshot, there are process-level locks that end up locking up my whole process. Inside TableMetadataParser, it looks like we read in every snapshot even though the reader likely only wants one snapshot. This loop is what's responsible for locking up my process. https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2F330f1520ce497153f7a6e9a80a22035ff9f6aa32%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FTableMetadataParser.java%23L320&data=04%7C01%7Cdawilcox%40adobe.com%7Cbc1c3031465d4d9c4cb408d8c2348a2b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637472878209170459%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6DdXO2jqzcvaeArY6gdpXh5%2BXlSu4gODOzBSbLqUS38%3D&reserved=0> I noticed that my process does not care about the whole snapshot list. My process only is interested in a particular snapshot -- just one of them. I'm interested in making a contribution so that the entire snapshot list is lazily calculated inside of TableMetadata where it's actually used. So, we would not create the Snapshot itself in TableMetadataParser, but instead likely would pass a SnapshotCreator in that could know how to create snapshots. We would pass all of the SnapshotCreators into TableMetadata which would create snapshots when needed. Would you be amenable to such a change? I want to make sure that you think that this sounds like something you would accept before I spend time coding it up. Any other thoughts on this? Thanks, David Wilcox -- Ryan Blue Software Engineer Netflix