amogh-jahagirdar edited a comment on pull request #4052:
URL: https://github.com/apache/iceberg/pull/4052#issuecomment-1043347180


   Sorry for the delay folks. Few updates:
   
   1.)  updated the PR with some integration tests and more unit tests. 
   
   2.) The deletion batch size is configurable through s3.delete.batch-size.
   
   3.) The default is 250 instead of 1000. Tbh I think some rigorous 
benchmarking should be done here. I set to 250 mostly mimicking the similar 
change done in hadoop-aws 
https://github.com/apache/hadoop/commit/56dee667707926f3796c7757be1a133a362f05c9
 which also used to perform batch deletions in 1000 until encountering major 
throttling issues. For reference if there are N keys in a batch, this will uses 
N requests in your throughput calculation done by S3 for controlling 
throttling. S3 limitations are 3500 TPS per prefix. So if we did 1000, in the 
worst case where most of the keys fall in the same prefix (if somebody has a 
hive-like file structure) then we would easily hit this limitation easily. If 
prefixes are better distributed we could get more throughput, but don't think 
we should rely on this assumption.
   
   4.) I do not think for the S3 case we need to worry about any de-duping. If 
the same key is passed in DeleteObjects multiple times, there are no failures. 
Also after the delete marker is set on the object, if deleteObjects is called 
later, the call still does not fail (DeleteObjects does not fail if the passed 
in keys do not exist. It's a no-op)
   
   @dramaticlly @jackye1995 Let me know your thoughts!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to