openinx opened a new pull request #2680:
URL: https://github.com/apache/iceberg/pull/2680


   Currently,  the 
[insertedRowMap](https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java#L110)
 in 
[BaseEqualityDeltaWriter](https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/core/src/main/java/org/apache/iceberg/io/BaseTaskWriter.java#L95)
  is a in-memory hash map,  which means it will be easily OOM if the data set 
is slightly larger than the given memory from task manager.   For example,  if 
we are migrating the full snapshot from mysql table to apache iceberg table,  
the existing data set from the mysql table will be quite large, but all those 
rows will be exported in the same flink checkpoint,  OOM will be easily 
happened.
   
   In this patch,  we are trying to provide a map that was backend with an 
embedded rocksdb, which means we could spill the rows into disk when exceeding 
to the given threshold.  The patch is still working in progess, will still need 
more test cases to make it available for reviewing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to