openinx opened a new pull request #2680: URL: https://github.com/apache/iceberg/pull/2680
Currently, the [insertedRowMap](https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java#L110) in [BaseEqualityDeltaWriter](https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java#L95) is a in-memory hash map, which means it will be easily OOM if the data set is slightly larger than the given memory from task manager. For example, if we are migrating the full snapshot from mysql table to apache iceberg table, the existing data set from the mysql table will be quite large, but all those rows will be exported in the same flink checkpoint, OOM will be easily happened. In this patch, we are trying to provide a map that was backend with an embedded rocksdb, which means we could spill the rows into disk when exceeding to the given threshold. The patch is still working in progess, will still need more test cases to make it available for reviewing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
