kbendick commented on issue #2033:
URL: https://github.com/apache/iceberg/issues/2033#issuecomment-758538209


   I have been thinking about this, and I have some questions related to the 
bucket being used for versioning.
   
   It's not an uncommon situation to have a versioned s3 bucket which does not 
have a policy which removes expired object deletion markers or a policy to 
expire non-current versions. By default, versioned buckets do not hav this. In 
such a situation, it's not uncommon to either have a very large number of 
object deletion markers or to have a single key with a very high number of 
versions (sometimes in the millions), which can greatly affect your S3 
throughput.
   
   I have personally encountered this issue when using a versioned bucket with 
Flink (without using iceberg) for storing checkpoint and savepoint data for 
jobs. For Flink, it's typical for the job manager to delete checkpoints 
depending on how many are configured to be saved. With regular checkpointing, 
it's very easy to then get a very large number of object deletion markers that 
are never expired. Additionally, it's not uncommon to setup a Flink job to 
checkpoint to a bucket where much of the data has a very similar prefix for the 
key (and therefore likely winds up in the same physical partition). For 
example, when using a per job cluster, where the job ids are always 
0000000000000000, it's easy to have your checkpoint data and savepoint data 
wind up with a long, consistent prefix in the key name (Flink provides a 
configuration to add randomness wherever desired in the checkpoint path).
   
   Additionally, I know that for RocksDB state backend in Flink there is a 
`/shared` directory when using incremental checkpointing that I have observed 
grow pretty much indefinitely. We have special logic in place to remove this 
folder when a valid savepoint is taken (amongst other criteria) at my work.
   
   **TLDR**: For a versioned S3 buckets, particularly for `PUT` and `DELETE` 
requests, the likelihood of getting a 503-slow down response increases quite a 
lot, due to the problem of so many object versions / a very large number of 
retained object deletion markers, and per partition throughput limitations. 
**When using Apache Flink, without having a policy in place to aggressively 
remove expired object versions and object deletion markers, it's not uncommon 
in my experience to run into 503-slow down issues in my personal experience.**
   
   **What you can do to debug this issue**: First and foremost, if you have 
access to the console (or if you're the one managing the bucket), I'd be sure 
that when enabling versioning that the required lifecycle policies are in 
place. That would be expiring noncurrent versions, removing object deletion 
markers, and removing stale / failed parts of multipart uploads (more on that 
below). Some things you can do to debug your current bucket, without having to 
create additional buckets just for testing etc, is to enable logs for your S3 
bucket. You can enable basic server access logs, which do not have added cost 
beyond the writes to the S3 bucket according to the instructions here: 
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/server-access-logging.html.
 Additionally, you can enable lifecycle logging and checking for the relevant 
lifecycle logs in cloudtrail to see what's happening with versions in your 
bucket: https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-and-other
 -bucket-config.html
   
   You can read some about it at the bottom of this page 
https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html. I've 
written the relevant part here. It does not mention it, but not having a bucket 
policy in place to remove expired object deletion markers will also cause this 
issue (I believe that the underlying issue is that it affects HEAD requests, 
which are needed for both PUT and DELETE on versioned buckets).
   
   ```
   If you notice a significant increase in the number of HTTP 503-slow down 
responses received for Amazon S3 PUT or DELETE object requests to a bucket that 
has S3 Versioning enabled, you might have one or more objects in the bucket for 
which there are millions of versions. For more information, see Troubleshooting 
Amazon S3.
   ```
   
   I would also be interested if you have any error logs @elkhand, as Iceberg 
retries requests but will eventually error out. So error logs would be helpful.
   
   Have you tested this using a fresh bucket, with no preexisting object keys? 
And did any transactions ever complete once versioning was enabled, or did it 
only happen after some time? Additionally, have you observed this issue with a 
bucket that started its life as a versioned bucket (or at the least, did not 
have any non-versioned keys in it). I've also encountered instances where 
version policies are placed on buckets after the fact, and a large number of 
objects remain in the bucket indefinitely because I've forgotten to remove them.
   
   Some things you can do to test this out, without having to create additional 
buckets just for testing etc, is to enable logs for your S3 bucket. You can 
enable basic server access logs, which do not have added cost beyond the writes 
to the S3 bucket according to the instructions here: 
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/server-access-logging.html.
 Additionally, you can enable lifecycle logging and checking for the relevant 
lifecycle logs in cloudtrail to see what's happening with versions in your 
bucket: 
https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-and-other-bucket-config.html
   
   Lastly, and perhaps _most importantly_, here is the documentation on 
lifecycle rules. I personally have experienced issues when using writing Flink 
savepoints and checkpoints to S3 buckets that were versioned, mostly because of 
the high frequency with which Flink can create and delete objects and then not 
having the proper lifecycle rules to handle expiring old versions, removing 
object deletion markers, as well as removing failed inflight multipart uploads 
(parts of a multipart upload that has never successfully completed - while 
there's not exactly a definitive way for the bucket to know fi the upload has 
failed or not, it's common to simply decide on a large enough time frame to 
then remove parts of a multipart upload if the upload does not complete - I 
typically use 24 hours or even 7 days - the most important thing is just having 
the policy in place, which AWS does not add by default). 
https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html
   
   If one does not explicitly add these policies to the bucket, these objects 
and their metadata will remain forever and severely impact S3 performance on 
versioned buckets. Additionally, there is cost associated with storing all of 
this useless (or potentially useless data, such as very old object versions) as 
AWS still bills you for them. So it's extra important to ensure that these are 
all in place.
   
   Without error logs, I'm not sure I can be of much more help. But I've been 
thinking about this issue recently and thought I'd add my personal experience 
with using versioned S3 buckets with Flink. I have been able to use Flink to 
read and write checkpoint data as well as data files to versioned S3 buckets, 
so I don't personally think that alone is the issue. However, I have 
experienced a lot of headaches when writing to versioned buckets without having 
aggressive policies in place to remove files with S3 lifecycle policies, as 
well as having a separate process in place for removing files from the 
`/shared` directory for rocksdb incremental checkpoints stored on S3. Having a 
large number of objects (which include different object versions as well as 
object deletion markers) is relatively easy to do with Flink without a good 
lifecycle policy in place.
   
   Lastly, it might also be important to note that if you've enabled versioning 
on a bucket, it can never technically be reverted to a non-versioned bucket. 
When turning off versioning on a versioned S3 bucket, it technically becomes a 
version-suspended S3 bucket. This means that your old files, including object 
deletion markers and non-current object versions, still exist and that only 
going forward will the changes take place. So if you've enabled versioning for 
some time and then turned it off, it's important to ensure that any unneeded 
non-current object versions / deleted object markers are removed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to