openinx commented on issue #2504: URL: https://github.com/apache/iceberg/issues/2504#issuecomment-824582275
@ayush-san , I think that's because we've maintained all the keys that come from the same checkpoint in a __in-memory__ HashMap, it mainly used to locate the `<file_id, pos>` for the rows that was written in the current checkpoint before. In the long run, we need to change this HashMap to a Map that can spill to disk or replace it with an embedded KV lib, so that we can take on a larger number of rows in a single checkpoint. [This](https://docs.google.com/presentation/d/18xL5hhGfJKEVJyv-fbfoLYWgioRMqoEutpKFDjXhyKA/edit#slide=id.gb479a3dd40_0_948) would be a good document to describe the current design. FYI @rdblue . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
